[kwlug-disc] Docker Host Appliance

Thu Jan 19 02:38:00 EST 2023

On Tue, Jan 17, 2023 at 03:45:20PM -0500, Doug Moen wrote:
>The advice for tuning OpenZFS performance for database and VM workloads 
>is here:
>
>https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#virtual-machines
>
>First, you create a "dedicated dataset" (dbs) or you store the VM image 
>in a "zvol or raw file to avoid overhead".

What "overhead" are they attempting to avoid by using RAW files or 
zvols?  Because there's potentially downsides with those as well.

If it's just a recommendation for best-case performance, fine. But 
generally, using a proper virtual disk format that allows your 
hypervisor to take snapshots is also important.

>Then you tune the blocksize of the dedicated DB or VM storage to match 
>the blocksize of the database instance or the VM guest filesystem (to 
>avoid overhead from partial record modification).

Anyways, the fact that there's ZFS docs for tuning VM storage for 
performance doesn't answer this earlier question:

>> People claim ZFS handles databases/VMs better than btrfs, but I don't 
>> really see how since it appears to use the same COW semantics. 

People specifically suggest disabling COW for VMs on BTRFS because your 
files will become fragmented. That's the reason.

This is technically true, but maybe should be weighed similarly to 
"mount filesystems with noatime because SSDs only have finite writes".

 From what I understand, ZFS uses a similar "COW" semantics to BTRFS, 
which means writes are never done in-place. Writes are always written 
elsewhere, and the file is updated to reference those new changes.

Here's an example:

Assume you're starting with a 10GB contiguous, non-fragmented, RAW VM 
file. And for arguments sake, assume you have at least one snapshot of 
this dataset or zvol (because of course you do, you're using ZFS.  
They're free).

If you start making changes to that file (system upgrades in the VM, 
etc), those writes are written elsewhere on the ZFS filesystem, and the 
file is updated to reference the new data blocks. We know the original 
contents are intact, because we can inspect them via the snapshot.

Now fast forward a few years of this VM doing it's thing, plus years 
worth of rolling hourly/daily/weekly/monthly snapshots.

Does this not cause fragmentation on the storage? Your live file will be 
nowhere near contiguous on the physical disk.

If "No": Please explain how ZFS avoids this. Because I haven't seen it 
discussed.

If "Yes": This is the same fragmentation issue that people warn about 
with BTRFS, causing them to say it can't do VMs, or you should disable 
COW, etc.

I'm not saying fragmentation of storage is the end of the world. Read 
ahead exists, caches exist. I'm just confused why BTRFS has a reputation 
of doing VMs poorly when, as far as I can tell, ZFS has the same 
behaviour.

The only reason I can possibly think of that this wouldn't be an issue 
on ZFS, is because it is so massively, massively RAM hungry and just 
brute-forces the problem with very aggressive read-ahead and caching.  
BTRFS, on the other hand, relies on the kernel's built-in caching.

-- 
Chris Irwin

email:   chris at chrisirwin.ca
  xmpp:   chris at chrisirwin.ca
   web: https://chrisirwin.ca