MWJ has responded to my last post, Don't Be a ZFS Hater, with a post of their own: You don't have to hate ZFS to know it's wrong for you.
I don't like the point-by-point quote and response format — it's way too much like an old-school Usenet flamewar. So I will simply try to hit the high points of their arguments.
Where we agree
ZFS is not ready to deploy to the entire Mac OS X user base today. There's still some work to be done.
ZFS isn't necessary for most of today's Macintosh computers. If you have been using your Mac with no storage-related problems, then you can keep on using it that way. Perform regular backups and you'll be just fine.
It would be an absolutely terrible idea to take people's perfectly working HFS+ installations on existing computers and forcibly convert them to ZFS, chuckling evilly all the while. Not quite sure where that strawman came from.
ZFS fatzaps are expensive for small files. If it were true that 20% of the files in a Mac OS X installation required a fatzap (pdf link to ZFS-on-disk specification), that would indeed be unnecessarily wasteful.
A typical Mac OS X 10.4.x installation has on the order of about 600,000 files.
I think that's about it. But of course there are a number of places where we disagree too.
ZFS would be awfully nice for a small segment of the Mac OS X user base if it were ready today.
If you spend any amount of time managing storage — if drives have gone bad on you, if you have ever run out of space on a desktop system and needed to add a drive (or two), if you have a RAID array — then you are the sort of user that could see some immediate benefit.
But of course as we already agreed, it's not ready today. You haven't been "cheated" and I'm sure you don't feel that way. But feel free to look forward to it: I sure am.
ZFS — or something with all the features of ZFS — will be more than nice, it will be necessary for tomorrow's Macintosh computers.
Both storage sizes and consumer consumption of storage grow exponentially. I tried to make this point last time, but MWJ seems to have misunderstood and accused me of misquoting. Let's try again.
In 1997, 20GB of storage meant a server RAID array. Ten years later, in 2007, 20GB of storage is considered "not enough" by most people. Across my entire household I have drives larger than that in my computer, in my TiVo, in my PlayStation 3, and even in my iPod. Now let's extrapolate that into the future.
In 2007, 20TB of storage means a server RAID array. Ten years from now, in 2017, 20TB of storage will similarly be considered "not enough". MWJ scoffed at ZFS because it's really pretty good at the problems of large storage. But you know what? A solution to managing that much data will need to be in place in Mac OS X well before 20TB drives become the norm. Better hope someone's working on it today.
Meanwhile — and this is what scares the pants off me — the reliability numbers for hard drives have improved much more slowly than capacity.
Here's a fairly typical Seagate drive with a capacity of ~150GB = ~1.2 x 1012 bits. The recoverable error rate is listed as 10 bits per 1012 bits. Let's put those numbers together. That means that if you read the entire surface of the disk, you'll typically get twelve bits back that are wrong and which a retry could have fixed. (Updated Oct 11 2007: In the comments, Anton corrected me: I should've used the unrecoverable error rate here, not the recoverable error rate. The net result is that in ideal operating conditions bit errors occur over 100x less frequently than I originally suggested. However, it's still not zero. The net result is still a looming problem when you scale it across (installed base) x (storage consumption) x (time). See the comment thread.)
Yes, really. Did you catch the implications of that? Silent single-bit errors are happening today. They happen much more often at high-end capacities and utilizations, and we often get lucky because some types of data (video, audio, etc) are resistant to that kind of single-bit error. But today's high end is tomorrow's medium end, and the day after tomorrow's low end. This problem is only going to get worse.
Worse, bit errors are cumulative. If you read and get a bit error, you might wind up writing it back out to disk too. Oops! Now that bit error just went from transient to permanent.
Still think end-to-end data integrity isn't worth it?
Apple using ZFS rather than writing their own is a smart choice.
As I hope I made abundantly clear in the last post, extending HFS+ to the future that we can see looming is just not an option — its structure is simply too far removed from these problems. It's really just not worth it. It's pretty awesome that the original HFS design scaled as far as it did: how many people can come up with a 20-year filesystem? But you have to know when to throw in the towel.
So if you accept that the things I described above are real, looming problems, then Apple really does need a filesystem with at least several of the more important attributes of ZFS.
The choices at this point are essentially twofold: (1) start completely from scratch, or (2) use ZFS. There's really no point in starting over. ZFS has a usable license and has been under development for at least five years by now. By the time you started over and burned five years on catching up it would be too late.
And I really do want to reiterate that the shared community of engineers from Apple, Sun, and FreeBSD working on ZFS is a real and measurable benefit. I've heard as much from friends in CoreOS. I can't understand the hostility to this very clear and obvious fact. It's as if Apple suddenly doubled or tripled the number of filesystem engineers it has available, snagging some really brilliant guys at the top of their profession in the process, and then multiplied its testing force by a factor of 10.
(To respond to a query voiced by MWJ, HFS+ never gathered that community when it was open-sourced because the design was already quite old at that point. It frankly didn't have anything new and exciting to offer, and it was saddled with performance problems and historical compromises of various kinds, so very few people were interested in it.)
ZFS fatzaps are unlikely to be a significant problem.
This gets a bit technical. Please skip this section if you don't care about this level of detail.
MWJ really pounded on this one. That was a bit weird to me, since it seemed to be suggesting that Apple would not expend any engineering effort on solving any obvious glaring problems with ZFS before releasing it. That's not the Apple I know.
But okay, let's suppose that we're stuck with ZFS and Mac OS X both frozen as they stand today. Let's try to make an a priori prediction of the actual cost of ZFS fatzaps on a typical Mac OS X system.
Classic HFS attributes (FinderInfo, ExtendedFinderInfo, etc) are largely unnecessary and unused today because the Finder uses
.DS_Store
files instead. In the few cases where these attributes are set and used by legacy code, they should fit easily in a small number of microzaps.Extended attributes may create fatzaps. Today it seems like extended attributes are typically used on large files: disk images, digital photos, etc. This may provoke squawking from the peanut gallery, but once a file is above a certain size — roughly a couple of megabytes — using an extra 128KiB is negligible. If you have a 4MiB file and you add 128KiB to track its attributes, big deal: you've added 3%. It's not nothing, but it's hardly a significant problem.
Another likely source of fatzaps in ZFS on Mac OS X is the resource fork. But with Classic gone, new Macs ship with virtually no resource forks on disk. There are none in the BSD subsystem. There are a handful in
/System
and/Library
, mostly fonts. The biggest culprits are large old applications like Quicken and Microsoft Office. A quick measurement on my heavily-used one-year-old laptop shows that I have exactly 1877 resource forks out of 722210 files — that's 0.2%, not 20%.(Fun fact: The space that would be consumed by fatzap headers for these resource files comes out to just 235 MiB, or roughly six and a half Keyboard Software Updates. Again: not nothing, but hardly a crisis to scream about.)
Want to measure it yourself? Amit Singh's excellent hfsdebug utility will show you a quick summary. Just run "sudo hfsdebug -s
" and look at the numbers for "files" and "non-zero resource forks". Or try "sudo hfsdebug -b attributes -l any | less
" to examine the files which have extended attributes on your disk.
ZFS snapshots don't have to be wasteful
The cheesesteak analogy was cute. But rather than imagining that snapshots just eat and eat and eat storage until you choke in a greasy pile of death, it would help if we all actually understand how hard drive storage is actually used in practice, and how ZFS can work with that.
There are three major classes of stored data.
- Static data is data that you want to keep and almost never modify. This is your archive. Photographs, music, digital video, applications, email, etc. Archives are additive: unless you really run out of room, you rarely delete the old — you only add new stuff. You want the contents safe and immediately accessible, but they are essentially unchanging.
Snapshotting static data is close enough to free that you won't notice: the only cost is the basic cost of the snapshot. No extraneous data copies are ever created, because you never modify or delete this stuff anyway.
- Dynamic data is data that you want to keep, but are modifying with some frequency. This is whatever you are working on at the moment. It might be writing a novel, working in Photoshop, or writing code: in all cases you keep saving new versions over the old.
Snapshotting dynamic data is more expensive, because if you do it too much without recycling your old snapshots then you can build up a large backlog.
- Transient data is data that should not be persistent at all. These are your temporary files: local caches, scratch files, compiler object files, downloaded zip files or disk images, etc. These may be created, modified, or deleted at any moment.
Snapshotting transient data is generally a bad idea — by definition you don't care that much about it and you'd prefer it to be deleted immediately.
Got all that? Okay. Now I need to make a couple of points.
First, I assert that virtually all of the data on personal computer hard drives is static most of the time. Think about that. The operating system is static the whole time you are using it, until you install a system update. (And even then, usually just a few hundred megabytes change out of several gigabytes.) Your /Applications
folder is static. Your music is static. And so on. Usually a few percent of your data is dynamic, and a few more percent is transient. But in most cases well over 95% is static. (Exceptions are easy to come up with: Sometimes you generate a large amount of transient data while building a disk image in iDVD or importing DV footage. That can shift the ratio below 95%. But once that task is complete you're back to the original ratio.)
Second, the biggest distinction that matters when snapshotting is separating persistent data from transient data. Taking snapshots of transient data is what will waste disk space in a hurry. Taking snapshots of dynamic data as a local backup is often valuable enough that it's okay to burn the small amount of disk space that it takes, because remember: that's the actual data that you're actively working on. And as we already mentioned, snapshots of static data are free.
Now here's where it gets interesting.
With ZFS, snapshots work on the filesystem level. Because it no longer uses the "big floppy" model of storage, new filesystems are very cheap to create. (They are almost as lightweight as directories, and often used to replace them.) So let's create one or more special filesystems just for transient data and exclude them from our regular snapshot process. In fact on Mac OS X that's easy: we have well-defined directories for transient data: ~/Library/Caches
, /tmp
, and so on. Link those all off to one or more transient filesystems and they will never wind up in a snapshot of the important stuff. I wouldn't expect users to do this for themselves, of course — but it could certainly be set up that way automatically by Apple.
Once the transient data is out of the picture, our snapshots will consist of 95% or more static data — which is not copied in any way — and a tiny percentage of dynamic data. And remember, the dynamic data is not even copied unless and until it changes. The net effect is very similar to doing an incremental backup of exactly and only the files you are working on. This is essentially a perfect local backup: no duplication except where it's actually needed.
Will you want to allow snapshots to live forever? Of course not. One reasonable model for taking backup snapshots might be to remember 12 hourly snapshots, 7 daily snapshots, and 4 weekly snapshots. If you are getting tight on storage the system could take new snapshots less frequently and expire them more aggressively. Remember: when nothing is changing the snapshots don't take up any space.
Wrap-up: Listen to the smart guys
Some very smart people at Sun started the ball rolling by putting an awful lot of thought into the future of storage, and they came up with ZFS.
After they announced it and started talking about it, other brilliant people at Apple (and FreeBSD, and NetBSD) paid attention to what they were doing. And they listened, and thought about it, and looked at the code, and wound up coming around to the side of ZFS as well.
If you think I'm smart, just know that I'm in awe of some of the guys who've been involved with this project.
If you think I'm stupid, why, I look forward to hearing from you in the comments.