[kwlug-disc] Old Man Yells at ZFS (was: Docker Host Appliance)

Tue Jan 17 01:19:15 EST 2023

This is kind of derailing into "Old man yells at Cloud" territory here, 
so I changed the subject line.

Also, I just want to re-iterate: I don't hate ZFS, though I am annoyed 
by some of it. I am actually using it on the NAS, and may stick with it, 
at least for now, because the pros outweigh the cons (and despite Linus' 
official stance of "Don't use it").

   https://www.realworldtech.com/forum/?threadid=189711&curpostid=189841

Gripes aside, I'm not actually trying to start a filesystem fight 
(otherwise I'd say "FAT16 ought to be enough for anybody" and run away).

On Mon, Jan 16, 2023 at 04:23:27PM -0800, Ronald Barnes wrote:
>Chris Irwin via kwlug-disc wrote on 16/01/2023 14.45:
>
>>But I'm a home user. And a cheap one, at that. I have a bunch of
>>data and a bunch of disks, and I want to be reasonably sure both the
>>data (and backups) are valid.
>
>Then ZFS is the only option - only it detects (and can *correct*)
>corrupt data (if I understand correctly).

Btrfs also does automatic error detection and correction.

Arguably you could count bcachefs as well, but it's still out-of-tree, 
and I haven't seen any reliability analysis on it. I'm not trusting it 
with the family photos just yet.

>From listening to them, as I understand it, you'd take the 2 new disks,
>make a vdev from them, then add that vdev to your pool and storage space
>has expanded.

Yeah, that is an option. Two new drives would be limited to mirroring 
each other, instead of expanding the existing raid-z vdev, reducing the 
amount of storage gained. It also necessitates adding drives in at least 
pairs (i.e., I can't go from 4 to 5 drives).

Honestly, though, we shouldn't have to watch a podcast, or study vdev 
calculators to figure out how to add drives effectively. We should be 
able to say "Here's two new drives" and have the existing raid-z expand 
to use them. We have this with mdadm, we have this with lvm, we have 
this with btrfs. It should be in zfs.

>>I've been using mdadm (+lvm) and btrfs for a lot of years,
>I use mdadm + lvm myself, but only through inertia. Adding btrfs to that
>is never gonna happen; as Doug pointed out, it's not reliable.

BTRFS is the default in Fedora Workstation, and SUSE offers commercial 
support for it.

The only caveat is don't use parity raid modes.

>And with layers upon layers (ext4 on lvm on mdadm), it still doesn't
>achieve the features of ZFS at a single layer.

LVM (itself, and it's snapshots) serve a different purpose, and use a 
very different mechanism than ZFS or BTRFS.

LVM snapshots are not designed to be long-lived. They cause write 
amplification just by existing. The idea behind LVM snapshots was to 
snapshot a state of a filesystem, use filesystem or generic tools 
(dump/rsync/etc) to stream it to tape/disk/cloud in a consistent state, 
then delete the snapshot. (although "snapshot, [dangerous task], 
merge/rollback" is possible with LVM, too)

You can do long-lived LVM snapshots if you're using thin-provisioned 
LVM, which also helps mitigate some of the write amplification issues.  
Other than Redhat's Stratis, this doesn't seem to be of any active 
interest or development (which is probably fine).

(That said, thin provision LVM is pretty neat, and I wish it existed 
years ago)

BTRFS & ZFS snapshots cause no write penalty for existing, because data 
is never written in-place anyway. So snapshots are "free" in terms of 
performance, and therefore, long-lived. While there is no additional 
write amplification, you do potentially suffer from file fragmentation 
over time (basically non-issue on solid-state, but maybe annoying on 
HDDs, depending on environment).

(Also, annoyingly, LVM and BTRFS/ZFS are all referred to as "COW", even 
though they do very different things with very different impacts (and 
only LVM actually has a "C" on "W". Don't get me started on qcow2...).

>Even rsync falls far short of ZFS's ability to detect a single changed
>block in a 1TB file and backup only that one block.

Yeah, filesystems like ZFS and BTRFS that break layer boundries are 
somewhat necessary to get that most efficient "Minimum difference 
between snapshot A and snapshot B" backup. Any after-the-fact comparison 
tool, like rsync, will never be able to compete in terms of speed or 
efficiency.

>>if you wanted. WIth BTRFS if I had four disks and one failed (and I 
>>have enough free space), I could rebalance the array to use one fewer
>>drive and recover a measure of redundancy while waiting for 
>>stock/sale/shipping/payday.
>
>Sounds like a handy feature.
>
>But again, with btrfs RAID5/6 *should not be used in production*.

Agreed, never use the parity modes in BTRFS.

However, if you missed that warning when you created your BTRFS 
filesystem and want to fix it, you can rebalance from RAID5 to RAID1 and 
be comfortably safe again. No need to reboot or even umount.

If you have a plethora of drives and want extra redundancy? Rebalance to 
RAID1c3, and you have three copies of everything.

If you decided you used the wrong layout for your zfs vdev, you need to 
destory it and recreate.

>>ZFS also can't fully utilize mismatched disks, apparently. My
>>4-drive array has 2x6TB and 2x8TB drives, which means there's 2x2TB
>>worth of unusable space on the 8TB drives. This worked fine with
>>btrfs.
>
>I believe you could have 1 vdev of 6TB drives and 1 vdev of the
>8TB drives together in a pool without losing the 2GB.

Correct, although using 2x mirrors to use that extra 2x2TB of disk ends 
up with a smaller usable amount of space than just using raid-z and 
losing that 2x2TB. It's not at capacity yet, so currently it's just 
annoying. I'll be at the angry muttering stage if these disks fill up, 
though.

-- 
Chris Irwin

email:   chris at chrisirwin.ca
  xmpp:   chris at chrisirwin.ca
   web: https://chrisirwin.ca