[kwlug-disc] Hardlinks, or Improving OfflineIMAP storage of GMail

Chris Irwin chris at chrisirwin.ca
Thu Mar 25 14:00:42 EDT 2010


I figured I'd post a tip to the list about the "hardlinks" utility.
It's use is of course not limited to mail, but this was my particular
use of it so far.

GMail has a deduplication feature that means that a single copy of a
message can exist in two "folders" at once, rather than each being a
separate copy of the message. This is noteworthy when you consider
that my message to this mailing list will exist in both my "Sent Mail"
and my "KWLug" folders (as the *same* message). Multiply that by the
number of lists I'm on, the number of time's I've sent myself mail,
etc. Since IMAP can't account for this, these messages are actually
downloaded and stored multiple times. Plus, GMail keeps a copy of
*every* message you have in a folder called "All Mail". So effectively
take your mailbox size and double it's on-disk footprint (actually
more-so. Deleted messages can still exist in "All Mail", but that is
another topic).

Luckily there is a utility called "hardlink" that does exactly what it
sounds like. It will scan a provided list of directories and hard-link
duplicate files.

$ hardlink .maildir-chrisirwin --dry-run -tvx "(cmeta|index)"

-t ignores differing timestamps
-x is an exclude regex. I'm excluding some files that Evolution scatters around.
-v is verbose
--dry-run skips actually doing the links.. Remove this to actually do the work.

After running this on my maildir, I saved 386 MiB, about 1/3 of the
initial size. Of course continual runs only save you a few KiB on
average (how big is the average mail), so I probably only really need
to run it every week or so, maybe before my backups.

Mode:     real
Files:    40045
Linked:   14894 files
Compared: 313760 files
Saved:    386.03 MiB
Duration: 27.60 seconds

Here is my actual disk useage, and again with double-counting the
linked files (I didn't think to do a list before).

$ du -hs .maildir-chrisirwin/
624M	.maildir-chrisirwin/

$ du -hsl .maildir-chrisirwin/
1.1G	.maildir-chrisirwin/

This is useful if you're working with limited disk space (such as a
small netbook or smaller SSD). Unfortunately it doesn't save you from
having to download each message twice, and it does still have to
initially store it somewhere on the first sync, this is purely a
corrective action to free up the disk space afterwards.

Also, while I haven't had any issues thus far, and I can't imagine
what issues this could cause, your mileage may vary.

-- 
Chris Irwin
<chris at chrisirwin.ca>




More information about the kwlug-disc mailing list