[kwlug-disc] Load average record?

Fri Jan 19 12:30:21 EST 2024

On Fri, Jan 19, 2024 at 12:07 PM Znoteer via kwlug-disc
<kwlug-disc at kwlug.org> wrote:
> On Fri, Jan 19, 2024 at 11:51:25AM -0500, Khalid Baheyeldin wrote:
> > Craziest thing I have seen in a long time ...
> >
> > Here is the uptime output:
> >
> > Uptime:  09:50:01 up 92 days, 23:22,  6 users,  load average: 4382.02,
> > 4354.61, 3862.00
>
> Wow. I once had my desktop go in the 20's and 30's for load average and I thought
> I must have hit gold. If I'd hit gold, you've gone well past titanium :)

More like: rusted out iron ...

I found those values from alert emails that I ignored during the crisis.

The disk that failed was one of a pair of HDDs in a software mdadm RAID1 that
were used for daily backups. But since the disk had plenty of free space, and
no load during the day, it was used for Varnish cache. This is a busy web site
so very heavy traffic on the disk, and that is what caused the 99% I/O wait.

Once I identified the problem, I switched Varnish cache to be in RAM, and the
site came back up.

There was another saga two days after when the data centre replaced the disk
but left the server in rescue mode for some reason. Lousy job by OVH support.

I am disappointed in mdadm. If one disk failed, shouldn't it just be
ignored and
I/O continues on the good one?

And because the array was busy, I could not remove the disk from the
array either.

Even after Varnish was restarted with RAM (no longer has anything to
do with the
failed array), there was a backup process (shell script that runs a
tar program) but
with the current working directory  being /backup which is the array.

That process was in UNINTERRUPTIBLE state (D) and would not be killed (not by
kill -TERM, -HUB, nor -9). And there were errors logged every few minutes about
failed reads/writes, and the log was growing.

End of rant ...