[kwlug-disc] NVME failure?

Mikalai Birukou mb at 3nsoft.com
Sun Aug 1 13:00:20 EDT 2021


> Short version: what is the longevity of NVME disks under heavy writes,
> e.g. for MySQL database?
>
> I am hoping that some hardware knowledgeable folk would clue me in on
> this hardware related issue.

I'll dump my own observations of SATA ssd disks. May be it has to do 
with flash/controllers? NVME speeds will let heavier pressure than SATA, 
though.


First set of observations is about 447GiB KINGSTON SA400S3. It is used 
as the system drive in my office machine. Sys partition, and home with 
all compilations and work are on the drive.

In March, it didn't start on the first try from power off state. It 
looked like there is no disk. But it was a cursory observation (night, 
different time zone). Disk was attached to ASRock server motherboard 
with i3. Following attempt(s?) to boot went ok, leaving only scarry 
memory. No SMART errors observed in gui Disks app in Ubuntu.

Just two weeks ago same thing repeated. But n-th attempt didn't help to 
boot. Booted from another drive: kingston drive is visible, did all 
fschk's, no errors reported again. Following attempt to boot suceeded, 
and I am still continuing to daily boot and work from this drive 
(overlapping now with retro motherboard :) , and I have been doing rsync 
to zfs pool on other drives -- paranoid is indeed a requirement).


Second set of observations is about 120GB Kingston drive.

LSI card, non hba, set as mirror used two ssd drives. Few months in, 
drive is marked dissappeared. Following power up-down cycle card sees 
drive, and in BIOS-time gui (circa 2010's), tell it to use disk again, 
with clean result. Its a gitlab runner box, i.e. desktop-like level of 
load. Same scenario repeats couple months later.

During reshuffling, I think this same Kingstone drive got as one of 
three in zfs mirror, as single drive from non-hba card. It disappeared 
within a year. Disk is still somewhere around.


Sorry for less detailed description. I'll remember to look at dmesg next 
time.

> Basically, I had a client who got a new server over a year ago. The
> hosting company, a large US based host, recommended that we use an
> NVME disk for MySQL. This is a plain old physical server running
> Ubuntu Server 20.04 LTS (no VMs, no docker). It has 64GB RAM and 16
> cores. The root file system is two SSDs that are software RAIDed
> together. 1
>
> We had kernel patches accumulating over that year, and I wanted to do
> a reboot to make sure that everything started normally. Upon rebooting
>
> [  128.001364] nvme nvme0: Device not ready; aborting initialisation
> [  128.002041] nvme nvme0: Removing after probe failure status: -19
>
> That NVME was a 1.6TB Micron 9200 MAX, if that matters.
>
> There was no device file under /dev/ for that disk anymore.
>
> After the host replacing the NVME, everything was normal, as below:
> [    7.558183] nvme nvme0: Shutdown timeout set to 10 seconds
> [    7.562576] nvme nvme0: 32/0/0 default/read/poll queues
> [    7.565741]  nvme0n1: p1
> ...
> Jul 31 15:27:54 live multipath: nvme0n1: failed to get udev uid:
> Invalid argument
> Jul 31 15:27:54 live multipath: nvme0n1: uid =
> eui.000000000000000100a0750128df8715 (sysfs)
> ...
> [    6.008941] nvme nvme0: pci function 0000:b6:00.0
>
> [   11.571864] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
> data mode. Opts: (null)
>
> Now to the questions:
>
> - Why would a device be functional before a reboot but totally go away
> after, and not being even detected by the operating system?
> - Are NVME as unreliable as SSDs or better? Or are they just faster?
>
> All thoughts appreciated ...
> --
> Khalid M. Baheyeldin
> 2bits.com, Inc.
> Fast Reliable Drupal
> Drupal performance optimization, hosting and consulting.
> "Sooner or later, this combustible mixture of ignorance and power is
> going to blow up in our faces." -- Dr. Carl Sagan





More information about the kwlug-disc mailing list