[kwlug-disc] New Linux PC Build Advice

Wed Oct 25 14:15:33 EDT 2023

I can understand the curiosity!  I only root caused to a certain point.

Initially, I took the system mostly at it’s word when it says <disk x> failed; but since I has having multiple "disk failures", I was thinking “tthat’s weird”.

But I replaced them anyway, and rather than immediately RMAing the “failed” disks, I stress tested them with diagnatic software I had (PC Check - was a proprietary system test suite, often used by OEMs for automated burn in testing, great comprehensive package; I suspect it’s long gone now).

Unable to replicate the slightest issue.

The system at 3x dual port disk expansion cards (dual port EIDE!  god these guys were cheap!) + motherboard ports, 10 disks, so some channels had dual disk.  I recall them as Hitachi 400 Gb disks, around 3 Tb usable, and able to hold down about 400-500 MB/sec, which allowed it to  keep up nicely with multiple 4x GigE ports (running NFSv2).  (Lots of Sun e450 and Linux clients.  Those e450’s had giant power supplies, could hold 20 drives, all nice as you please… and here I’m hacking this EIDE spagetti thing together while /looking right at them/.  In retrospect, it seems like it should have been trivial to sell them on going that way; I can’t explain it!)

Anyway, it’d idle well enough, but sometimes when it was under IO load, you’d just see a random IO error pop up on a random device in dmesg, and md would go “oh god, failed array!”, and then you’d have to attempt a rebuild… which pushed the IOPS into orbit, and sometimes it’d succeed, other times you’d loose the array to a second glitch.

I suspect that bus contention/issues might have been the key — I got it to one disk per channel, optimized the cabling.  In my memory, I’m not certain what the fix ultimately was (intermittent issues with production systems can be like that) , but it ultimately ran mostly stable.  It lasted a 3-4 years, but damn that first year was rough.

I’ve heard tell of there being glitchiness with some controllers, perhaps electrical noise on those olde schoole ribbon cables, perhaps the power supply was sagging a little too much at peak load, perhaps there were some issues with md that have since been sorted.  I don’t think it was heat — there was a lot going on in that case, and I made sure there were fans, and monitored temps.

I’ve certainly lost md arrays because unexpected shutdowns, so I love redudancy on md devices, and I strongly favour true h/w RAID cards, becuase it’s my feeling that this is a good space in which to black-box all the IO complexity of RAID, and put a nice NVRAM cache on all the writes.

I do love me some ZFS, and I’m currently running my home NAS on UnRaid.  I’ve had fantastic luck with PERC and LSI cards flashed to IT mode (disables all the RAID for extra IO ops), and BackBlaze definitely has proven that you can run truly huge numbers of devices via fan-out  controllers.

Thanks for the trip down memory lane!  Hope it was interesting,

  -Cedric

│ CCj/ClearLine - Hosting and TCP/IP Network Services since 1997
├──────────────────────────────
│ Cedric Puddy, IS Director, cedric at ccj.host, 519-489-0478x102

> On Oct 24, 2023, at 16:28, Chris Frey <cdfrey at foursquare.net> wrote:
> 
> On Tue, Oct 24, 2023 at 03:10:15PM -0400, Cedric Puddy wrote:
>> It’s not as much these days with NVME and such, and the risks are
>> way lower with only two devices in the set.  I had a client insist on
>> me building a multi-spindle, md based, NFS server that used multiple
>> disk controller cards, back in the day.  It was 10 spindles RAID 5, and
>> I must have rebuilt that array 5 times in the course of year, without a
>> single actual drive failure.  It was not a good design, and definitely a
>> false economy; one of them there learnin’ experiences for all concerned.
> 
> What caused the failures, since it wasn't the disks?  Controller?
> Software?
> 
> - Chris
> 
> 
> _______________________________________________
> kwlug-disc mailing list
> To unsubscribe, send an email to kwlug-disc-leave at kwlug.org
> with the subject "unsubscribe", or email
> kwlug-disc-owner at kwlug.org to contact a human being.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://kwlug.org/pipermail/kwlug-disc_kwlug.org/attachments/20231025/75c265c9/attachment.htm>