Recording Artist: Don't be a ZFS Hater

Saturday, October 06, 2007

Don't be a ZFS Hater

John Gruber recently linked to — and thus gave credibility to — a MWJ post ripping on a fairly reasonable AppleInsider post about ZFS. Representative quote:

“We don't find HFS Plus administration to be complex, and we can't tell you what those other things mean, but they sound really cool, and therefore we want them. On the magic unlocked iPhone. For free.”

Har har har. Wait. Hold on a minute. Why is it suddenly fashionable to bash on ZFS?

Part of it is a backlash to the weird and obviously fake rumor about it becoming the default in Leopard, I guess. (No thanks to Sun's CEO Jonathan Schwartz here, who as far as I know has never publicly said anything about why he either misspoke or misunderstood what was going on back in June.)

But don't do that. Don't be a ZFS hater.

A word about my background

Let's get the credentials out of the way up front. Today I work on a file I/O subsystem for PlayStation 3 games. Before that, I worked in Apple's CoreOS filesystems group. Before that, I worked on DiscRecording.framework, and singlehandedly created the content subframework that streamed out HFS+, ISO-9660, and Joliet filesystems. Before that, I worked on the same thing for Mac OS 9. And before that, I worked on mass storage drivers for external USB/FireWire drives and internal ATA/ATAPI/SCSI drives.

You might say I know a thing or two about filesystems and storage.

What bugged me about the article

ZFS is a fine candidate to replace HFS+ eventually. It's not going to happen overnight, no. And it'll be available as an option for early adopters way before it becomes the default. But several years from now? Absolutely.
The bizarre rants about ZFS wasting processor time and disk space. I'm sorry, I wasn't aware that we were still using 30MHz machines with 1.44MB floppies. ZFS is great specifically because it takes two things that modern computers tend to have a surplus of — CPU time and hard disk space — and borrows a bit of it in the name of data integrity and ease of use. This tradeoff made very little sense in, say, 1992. But here in 2007 it's brilliant.
Sneeringly implying that HFS+ is sufficient. Sure, HFS+ administration is simple, but it's also inflexible. It locks you into what I call the "big floppy" model of storage. This only gets more and more painful as disks get bigger and bigger. Storage management has come a long way since the original HFS was created, and ZFS administration lets you do things that HFS+ can only dream of.
Claiming that RAID-Z is required for checksums to be useful. This is flat-out wrong. Sure, RAID-Z helps a lot by storing an error-correcting code. But even without RAID-Z, simply recognizing that the data is bad gets you well down the road to recovering from an error — depending on the exact nature of the problem, a simple retry loop can in fact get you the right data the second or third time. And as soon as you know there is a problem you can mark the block as bad and aggressively copy it elsewhere to preserve it. I suppose the author would prefer that the filesystem silently returned bad data?
Completely ignoring Moore's Law. How dumb do you need to be to willfully ignore the fact that the things that are bleeding-edge today will be commonplace tomorrow? Twenty gigabyte disks were massive server arrays ten years ago. Today I use a hard drive ten times bigger than that just to watch TV.

Reading this article made me feel like I was back in 1996 listening to people debate cooperative vs preemptive multitasking. In the Mac community at that time there were, I'm ashamed to say, a lot of heated discussions about how preemptive threading was unnecessary. There were some people (like me) who were clamoring for a preemptive scheduler, while others defended the status quo — claiming, among other things, that Mac OS 8's cooperative threads "weren't that bad" and were "fine if you used them correctly". Um, yeah.

Since then we've thoroughly settled that debate, of course. And if you know anything about technology you might be able to understand why there's a difference between "not that bad" and "a completely new paradigm".

ZFS is cool

Let's do a short rundown of reasons why I, a qualified filesystem and storage engineer, think that ZFS is cool. I'll leave out some of the more technical reasons and just try to keep it in plain English, with links for further reading.

Logical Volume Management. Hard disks are no longer big floppies. They are building blocks that you can just drop in to add storage to your system. Partitioning, formatting, migrating data from old small drive to new big drive -- these all go away.
Adaptive Replacement Caching. ZFS uses a smarter cache eviction algorithm than OSX's UBC, which lets it deal well with data that is streamed and only read once. (Sound familiar? It could obsolete the need for F_NOCACHE.)
Snapshots. Think about how drastically the trash can metaphor changed the way people worked with files. Snapshots are the same concept, extended system-wide. They can eliminate entire classes of problems.
I don't know about you, but in the past year I have done all of the following, sometimes more than once. Snapshots would've made these a non-issue:
- installed a software update and then found out it broke something
- held off on installing a software update because I was afraid something might break
- lost some work in between backups
- accidentally deleted an entire directory with a mistyped rm -rf or SCM delete command.
Copy-on-write in the filesystem makes snapshots super-cheap to implement. No, not "free", just "so cheap you wouldn't possibly notice". If you are grousing about wasted disk space, you don't understand how it works. Mac OS X uses copy-on-write extensively in its virtual memory system because it's both cheap and incredibly effective at reducing wasted memory. The same thing applies to the filesystem.
End-to-end data integrity. Journaling is the only thing HFS+ does to prevent data loss. This is hugely important, and big props are due to Dominic Giampaolo for hacking it in. But journaling only protects the write stage. Once the bits are on the disk, HFS+ simply assumes they're correct.
But as disks get larger and cheaper, we're finding that this isn't sufficient any more. The odds of any one bit being wrong are very small. And yet the 200GB hard disk in my laptop has a capacity of about 1.6 trillion bits. The cumulative probability that EVERY SINGLE ONE of those bits are correct is effectively zero. Zero!
Backups are one answer to this problem, but as your data set gets larger they get more and more expensive and slow. (How do you back up a terabyte's worth of data? How long does it take? Worse still, how do you really know that your backup actually worked instead of just appearing to work?) So far, disk capacity has consistently grown faster than disk speed, meaning that backups will only continue to get slower and slower. Boy, wouldn't it be great if the filesystem — which is the natural bottleneck for everything disk-related — helped you out a little more on this? ZFS does.
Combinations of the above. There are some pretty cool results that fall out of having all of these things together in one place. Even if HFS+ supported snapshots, you'd still be limited by the "big floppy" storage model. It really starts to get interesting when you combine snapshots with smart use of logical volume management. And we've already discussed how RAID-Z enhances ZFS's basic built-in end-to-end data integrity by adding stronger error correction. There are other cool combinations too. It all adds up to a whole which is greater than the sum of its parts.

Is any of this stuff new and unique to ZFS? Not really. Bits and pieces of everything I've mentioned above have showed up in many places.

What ZFS brings to the table is that it's the total package — everything all wrapped up in one place, already integrated, and in fact already shipping and working. If you happened to be looking for a next-generation filesystem, and Apple is, you wouldn't need to look much further than ZFS.

Still not convinced?

Okay, here are three further high-level benefits of ZFS over HFS+:

Designed to support Unix. There are a lot of subtleties to supporting a modern Unix system. HFS+ was not really designed for that purpose. Yes, it's been hacked up to support Unix permissions, node locking, lazy zero-fill, symbolic links, hard links, NFS readdir semantics, and more. Some of these were easy. Others were painful and exhibit subtle bugs or performance problems to this day.
Designed to support modern filesystem concepts. Transactions. Write cache safety. Sparse files. Extended metadata attributes. I/O sorting and priority. Multiple prefetch streams. Compression. Encryption. And that's just off the top of my head.
HFS+ is 10-year-old code built on a 20-year-old design. It's been extended to do some of this, and could in theory be extended to do some of the others... but not all of them. You'll just have to trust me that it's getting to the point where some of this stuff is just not worth the engineering cost of hacking it in. HFS+ is great, but it's getting old and creaky.
Actually used by someone besides Apple. Don't underestimate the value of a shared standard. If both Sun and Apple start using the same open-source filesystem, it creates a lot of momentum behind it. Having more OS clients means that you get a lot more eyes on the code, which improves the code via bugfixes and performance enhancements. This makes the code better, which makes ZFS more attractive to new clients, which means more eyes on the code, which means the code gets better.... it's a virtuous circle.

Is ZFS the perfect filesystem? I doubt it. I'm sure it's got its limitations just like any other filesystem. In particular, wrapping a GUI around its administration options and coming up with good default parameters will be an interesting trick, and I look forward to seeing how Apple does it.

But really, seriously, dude. The kid is cool. Don't be like that.

Don't be a ZFS hater.

Updates:

MWJ's response: You don't have to be a ZFS hater to know it's wrong for you.
My followup: ZFS Hater Redux

63 comments:

Eschaton said...

Storage management has come a long way since the original HFS was created

Hell, storage management came a long way before the original HFS was created. It's not as backward as, say, the original Unix filesystem (the inode_t+char[14] one) or MFS, but still.

3:31 AM, October 06, 2007
Devin Coughlin said...

ZFS sounds awesome. The one thing I worry about is whether ZFS on the mac will suport case preserving but insensitive filenames. This seems to me to be a rather important part of the user interface and I wonder if switching to ZFS without it would be a good idea for non-developer types.

11:52 AM, October 06, 2007
Drew Thaler said...

I'm pretty sure ZFS doesn't do case-insensitive filenames. But that's a good thing.

After having worked with its implementation over and over again in filesystem after filesystem, with the most recent example being about three months ago, I personally hate case-insensitivity. :-) The filesystem is the wrong place for it. It slows down performance, drags huge text-encoding tables into the kernel, creates heinous and subtle encoding problems, and reinforces non-portable bad habits. Using a real 1:1 mapping between keys and data is much better.

If you want case-insensitivity in the UI, I agree. Let's enforce it the Save dialog and when renaming files in the Finder, which are just about the only two places you actually need it. But pushing it down into the bottommost layers of the storage stack is just wrong.

12:07 PM, October 06, 2007
Anonymous said...

I had no idea anyone was angry at ZFS. This is why I don't read nerd blogs ;)

Considering all the problems people are having with GUID partition tables, I am kinda glad it's *not* being rolled out right now, though. I don't trust Sergio or whoever else is still kicking around to get it right the first time.

12:14 PM, October 06, 2007
Anonymous said...

How about the claim by MacJournals that "ZFS can be a lot slower because it imposes tons of overhead on the kind of tiny files that Mac OS X uses by the thousands."

Is there any truth to that?

4:16 PM, October 07, 2007
Emanuele "∞" Vulcano said...

Time Machine could be retooled to be a ZFS snapshot access UI in the future...

4:38 PM, October 07, 2007
Orion Edwards said...

The filesystem is the wrong place for it. It slows down performance, drags huge text-encoding tables into the kernel, creates heinous and subtle encoding problems, and reinforces non-portable bad habits. Using a real 1:1 mapping between keys and data is much better.

I agree with you in theory, but if you push it out to the application level you just know some third party app is going to get it wrong and start causing you even more headaches.

"Trust the user/application to do the right thing" has been proven over and over again to be a terrible idea :-(

4:43 PM, October 07, 2007
Anonymous said...

I think you may have slightly mis-interpreted the MacJournals article. It was less about ZFS and more about AI reporting and the way that people assume that the "new thing" must surely be entirely more awesome than "the old thing." That's how I took it anyway.

If you want case-insensitivity in the UI, I agree. Let's enforce it the Save dialog and when renaming files in the Finder, which are just about the only two places you actually need it.

Wouldn't that only be true for applications that use the Apple provided Save dialog? And your personal preferences aside, you can't possibly argue that your great aunt Muriel should instinctively understand that BananaPuddingRecipe.doc and bananapuddingrecipe.doc are two different files?That's just absurd.

4:44 PM, October 07, 2007
Drew Thaler said...

The short answer is "no". There's no truth to it.

Here's the longer answer. If you are looking at ZFS without understanding how filesystems work, it might seem like there's a lot of overhead. But in fact any filesystem has a lot of overhead for tiny files. For example, HFS+ maintains a file record and a file thread record in the catalog, some allocation bits in the volume bitmap, and an extended attribute record in a separate attribute file. And all writes to all of these different pieces are funneled through a separate journal file. (And then there's the cost of the path lookup, which is more expensive in HFS+ than just about any other filesystem.)

But is that really "significant"? No. What we're talking about is a lot like counting cycles ... traversing data structures costs a fraction of a microsecond here or there. The cost of an actual I/O is normally measured in units a thousand times larger than that. ZFS focuses its attention on the expensive part -- the I/O -- and does things like transactional write-gathering to improve it. If you can solve a large inefficiency at the cost of several small inefficiencies, you're still in the black.

4:59 PM, October 07, 2007
Unknown said...

Thank you Drew,

I hope this article gets more exposure, because currently, too many articles got too much exposure from people who have no clue about ZFS in particular, or file systems in general.

Alex, who -- in his younger days -- would know HFS, ISO 9660 and ISO 13346 by heart

5:15 PM, October 07, 2007
Drew Thaler said...

Wouldn't that only be true for applications that use the Apple provided Save dialog?

Virtually every application goes through the standard Save dialog. It's not 1997: people just don't write their own custom dialogs any more. And since OSX enforces Carbon, there is no old skanky pre-NavServices code to worry about.

your great aunt Muriel...

I think you mean Aunt Tillie.

How would she get two such files? She won't create them herself -- the Save dialog won't let her. She won't rename something accidentally -- the Finder won't let her either. Even if she does get them somehow, when she opens the folder, she'll be in the default column view which will nicely sort the two files right next to each other.

Honestly, I think it's just a bugaboo. It's largely a non-issue these days. Someone just needs to have the balls to actually say so. :-)

5:30 PM, October 07, 2007
Nicholas Riley said...

The Finder does already enforce more strict naming conventions than the underlying filesystems, so this wouldn't be anything new. But unfortunately, we're not out of the woods yet—I noticed last week that the Adobe CS3 apps have their own open/save dialogs, though the Apple ones are an option.

In any case, Sun already implemented case-insensitive/preserving behavior for ZFS to facilitate serving SMB/CIFS from ZFS. That it works better for the Mac is a happy coincidence.

5:39 PM, October 07, 2007
Anonymous said...

> A word about my background

... and you wrote CLImax, the coolest utility I ever saw for Mac OS.

6:04 PM, October 07, 2007
Anonymous said...

The argument that case insensitivity helps clueless people is bogus.

Yes, Aunt Tillie will think that "Muffin Recipe.rtf" and "muffin recipe.rtf" ought to be the same file. But you know what? She'll also think that "Muffin Recipe .rtf" and "Recipe for Muffins.rtf" and "Mufin Recipe.txt" ought to be the same file too.

Users already don't generally understand how the OS decides whether two files are the same or not. Trying to alleviate this problem by mapping names with different case to the same file solves only 1% of the problem and just isn't worth the effort. Likewise, throwing that away only causes 1% more problems and is totally worth it if it brings just about any benefit at all.

6:18 PM, October 07, 2007
Anonymous said...

The "sacrifice a small amount of processor resource for greater data integrity" seems especially relevant today.

I went for the first 10 or so years of my computing life without a single hard drive failure. Those things just kept going and going.

Today, we can buy huge disks at insanely low prices, but the failure rates seem to be astronomical. The low cost and capacity of modern hard drives has come at the expense of reliability.

I wouldn't be surprised to see a 100% failure rate over 4 years in the near future.

6:31 PM, October 07, 2007
Anonymous said...

Unicode is hard™. Switching to a lazy method of using a bucket of bits does not protect you from unicode.

6:32 PM, October 07, 2007
Anonymous said...

I am a big ZFS fan but I hate to think this is the reason ZFS won't be in OS X as soon as we would like given Legal holdups: http://blogs.netapp.com/dave/2007/09/netapp-sues-sun.html

Thoughts?

6:52 PM, October 07, 2007
lar3ry said...

Quote: Yes, Aunt Tillie will think that "Muffin Recipe.rtf" and "muffin recipe.rtf" ought to be the same file. But you know what? She'll also think that "Muffin Recipe .rtf" and "Recipe for Muffins.rtf" and "Mufin Recipe.txt" ought to be the same file too.

... which is the reason people like to use things like "Google" and "Spotlight" to search for things...

7:29 PM, October 07, 2007
Anonymous said...

Since when is case-insensitivity claimed to ONLY help clueless people? I am not Aunt Millie, but it is entirely possible for me to simply make a mistake, not realize I've already saved a file called "Last quarter" in a particular folder, make a new one named "last quarter," and only later find out about the annoying duplication I've done.

Claiming something that's a safeguard against mistakes is a bad idea because some people are stupid anyway is silly.

As far as making it a higher-level function, it would be nice if all applications did things the 'correct' way, but even on the Mac this is not the case. Any option which depends on every single software developer doing it a certain way is by nature less fault-tolerant than an option enforced by the system.

7:31 PM, October 07, 2007
Anonymous said...

Is there an easy way to play with ZFS right now? Is there a unix / linux distro that has it out of the box?

7:35 PM, October 07, 2007
Anonymous said...

@devin

Case insensitive filenames are already gone in Leopard. Simply because POSIX and SUSv3 don't allow it.

7:44 PM, October 07, 2007
Anonymous said...

OK, what I actually didn't like about the MDJ rant (and I haven't seen addressed here) was that they assumed that ZFS would be less efficient simply because it was more sophisticated. Knowing the internals of HFS+ quite well, and studying ZFS a bit, it's not that clear-cut to me on paper that ZFS should take substantially more system resources. There are a _lot_ of inefficient details in HFS+ that ZFS either doesn't do or does a helluva lot more performantly.

8:06 PM, October 07, 2007
Fred Blasdel said...

Anonymous is wrong, case-sensitivity is alive and well in Leopard, I just checked in the latest build. It behaves just like previous versions, both from the Finder/Carbon/Cocoa and from POSIX.

8:22 PM, October 07, 2007
Travis Butler said...

Sorry, you just punched a couple of my buttons there.

First, count me on the case-insensitive side. Yes, it is a huge, KILLER problem on the user side - heck, can you imagine trying to do phone support? I've already had enough trouble trying to cope with case-sensitivity in passwords, thank you very much. And implementing it at any level above the filesystem is not just asking but begging for trouble. The colon/slash hack for filenames is still causing trouble six years later!

Then there was the crack about 30mhz processors and floppies. That always sets me off, because too many times out of ten it's an excuse for lazy programming, and it still costs. One case of bloat might not make much difference, but when it gets used as a common excuse, things start adding up... Sure, disk space has been increasing exponentially. But so have storage-intensive applications for that data, and it's only going to get worse. I've got more than ten gigs of pics alone, 40-50 gigs of music and gobs of raw and processed video.

Also, remember the laptop? You know, the fastest-growing segment of the computer industry? The one where we just can't plug in another drive whenever we need more space?The one where the form factor has an upper storage limit per mechanism that's 4-5x smaller than desktop mechanisms, at several time the cost per gig? You may be able to afford to plop in a 200 gig drive, I could only afford a 120, and I'm constantly working to keep it clean enough to use.

The moment someone starts claiming that Moore's Law means resource use doesn't matter, they dig themselves a deep hole for their argument.

ZFS may indeed have lots of advantages, and I want to see a replacement for HFS, but don't start claiming we don't need to count the costs.

9:09 PM, October 07, 2007
Drew Thaler said...

I fully agree with Jeff Bonwick when he says, "Performance is a goal, correctness is a constraint."

Travis, never once did I claim that we don't need to count costs. The point is that you need to intelligently make tradeoffs. We do this all the time in computer architecture.

* A preemptive thread scheduler takes much more overhead than a cooperative thread scheduler.
* TCP has significantly more overhead than UDP.
* Dynamic linking is much slower than static linking.
* Objective-C messaging is slower than a function call.
* XML, even binary XML, is much more verbose and wasteful than using a proprietary format.
* Quartz double-buffering and compositing chews way more memory and resources than QuickDraw used to.

And yet it seems that we have basically settled on the "slower" choice for all of these things. Mysterious!

I don't have any hard info on whether ZFS is going to be slower than HFS+, but it would really surprise me if it was. As I said above, I/O throughput is THE bottleneck. If you speed that up with smarter access patterns then you wind up with something that is actually faster. I haven't had a chance to play with it on Mac OS X yet, but that's been the result of benchmarking that I've seen on other platforms — even with its end-to-end checksumming, it's as fast or faster than other filesystems.

9:37 PM, October 07, 2007
Anonymous said...

Found your comments on integrity interesting. In theory, the drive firmware should do error correction. In practice, my experience is that this is not reliable (especially with overheated drives in machines designed with form over function *cough*iMac*cough*). This is why I CRC my app's (large, long-lived, changed often) data files. I shouldn't have to do that but experience has proven it to be necessary. Error correction (or at least detection) in the filesystem gets my vote.

9:50 PM, October 07, 2007
Drew Thaler said...

By the way, quick clarification: when I invoked Moore's law, I didn't mean "oh, it's okay to be wasteful and slow because computers will catch up".

What I meant was that part of the job of an operating system engineer is to be a futurist.

When you are writing code that will last a decade or more, you need to think about tomorrow's problems and start solving them right now. In all likelihood, in much less than ten years entry-level computers like the iMac will ship with more than the 20 terabytes of storage that the MWJ article joked about. That may boggle your mind, but it's true.

If you're an OS engineer and you aren't hard at work solving the problems intrinsic to that TODAY, you're not doing your job. HFS+ is "not that bad" for today's needs. It's going to be woefully inadequate real soon now.

10:06 PM, October 07, 2007
Unknown said...

Beyond just case sensitivity/insensitivity, how does ZFS deal with differently-normalized unicode? It's terrible for users if there are two files called “fóo.jpg”, with the “ó” normalized differently in the two. That is, two files with literally the same characters in their names, counted as separate because they use different code points. HFS+ normalizes everything to the decomposed form, while other filesystems happily save the two names as different files.

10:33 PM, October 07, 2007
Drew Thaler said...

Sam: If you want to play with ZFS, the Solaris Express Developer Edition is your best bet. You could also try a recent FreeBSD snapshot ... it's been in top-of-tree FreeBSD since April, so I believe 7.0-CURRENT should have it.

Linux has a licensing problem with ZFS: Sun's CDDL is not compatible with Linux's GPLv2. There is a userspace ZFS-via-FUSE implementation, but last I heard there were some performance problems.

10:38 PM, October 07, 2007
Chris Adams said...

I know I shouldn't be but I'm still shocked that people are arguing against data integrity - if I didn't care about my data, I could just read /dev/random and avoid the inconvenience of having to come up with more than one filename, too.

There's a really interesting report from CERN about the observed real-world failure rates on their systems - about one out every 1500 files had an otherwise undetected a single-bit error. Sure, CERN's environment is more complicated than most but they also have higher quality equipment and people who are a lot more likely to notice a failure than J. Random User who has a single internal disk, a couple of firewire drives and last ran a backup during the Clinton administration.

I think ZFS' biggest problem is that people associate it with scary enterprise Unix admins, which is really amusing given the history of it. Solaris already had horrible, unfriendly volume managers and filesystems which required gurus to avoid data loss - ZFS was an attempt to get away from all of that. It's a bit disappointing that the process of connecting storage in 2007 is more complicated than answering a simple prompt: "You've added a new disk. Do you want to replace your current drive, protect your data from a drive failure or expand your storage capacity?"

11:45 PM, October 07, 2007
Anonymous said...

"How would she get two such files? She won't create them herself -- the Save dialog won't let her."

She makes a file called "muffin recipe.rtf" in one folder, then a file called "Muffin Recipe.rtf" in another folder, then drags one file into the other folder.

Things like this have happened to me on Linux, albeit the names were more like "Widget.rb" and "widget.rb". Took me a while to figure out what was going wrong (I didn't notice I'd ended up with two files where I expected only one.)

12:46 AM, October 08, 2007
Anonymous said...

As far as I know, MacOS X can be both case-sensitive and case-insensitive, depending on the format of the disk (HFS+ and UFS respectively I think). But some applications assume case-insensitive, and will fail when installed on a case-sensitive volume.
Oh, and I vote for case-insensitive/case-preserving, based on my 'what would the wife think was correct' test. Case-sensitive is so obviously a product of the programmer mindset, where 'a' and 'A' are different. Of couse, that doesn't mean that the underlying file system can't support it - just so long as the user sees case-insensitive/case-preserved.

1:32 AM, October 08, 2007
Anonymous said...

Question for you, Drew. Are you familiar with Deep Freeze? I've been using it on my Mac for a little while and it does a lot of what you say Snapshots on ZFS do. Just restart and every change you made is gone. I have no idea how it works since it doesn't make a separate image of your system, but they're definitely doing something at the kernel level. Anyway, thanks for the great post.

2:15 AM, October 08, 2007
LKM said...

I think you missed MacJournal's in-depth article about ZFS and only read the rant-ish blog post. MacJournals makes a lot of good points. It's very obvious that ZFS can't possibly become the default file system for OS X any time soon - maybe ever.

MacJournals did point out that ZFS is an awesome file system. It's just not suitable as the main, bootable Mac OS X file system. So it's hardly "ZFS hating."

Matt Deatherage has some credibility in this area. If I remember correctly (and I can't find any sources), he actually used to work on file system code at Apple, right?

2:26 AM, October 08, 2007
LKM said...

>Time Machine could be retooled to be
>a ZFS snapshot access UI in the
>future...

No. Snapshots and Backup are not the same thing. Snapshots as used by ZFS are stored on the same volume as your data. In fact, a Snapshot is nothing else but a kind of flag telling the file system to not really delete any of the data.

So Snapshots take up a lot of storage space, quickly. And if your drive goes, so do your Snapshots.

Backups, on the other hand, are on an external storage system. Time Machine is backup, not snapshot.

2:30 AM, October 08, 2007
Anonymous said...

@travis:"Then there was the crack about 30mhz processors and floppies. That always sets me off, because too many times out of ten it's an excuse for lazy programming, and it still costs."

Drew continued with a bunch of simple, objective items, but travis' comment immediately reminded me of a subjective complaint from the old low-level-engineering types who'd get all pissy that the Mac (and later, other platforms) was "wasting" too many system resources "drawing pretty pictures on screens", when the fact of the matter was, IT'S EXACTLY HOW PEOPLE ACCESS a client computer.

4:05 AM, October 08, 2007
Simos said...

I think case insetivity is a little ridiculous. It is very confusing for me. fairyTale.txt IS not FairyTALE.txt Why it should be?

4:23 AM, October 08, 2007
Alden said...

We speak and write in a case-sensitive language, texters and slackers aside.

I have set up Macs with HFS+ Case Sensitive before, and found that most things worked great - and some items (like MarkSpace Missing Sync) had some errors that traced directly to uncaught coding errors that depended on case insensitivity. Moving to CS is like moving to GCC 4 - some of your misses will be caught. Deal with it.

This may be moot, anyway - ZFS has incorporated a CI mode at Apple's request - the bug report/feature request circulated earlier this year.

That CERN report is only part of the story. In large storage environments, the 'floppy' model of HDDs is painfully stupid. Imagine running your desktop off an array of floppy drives, with the enforced segmentation. The Sun guys compared the ZFS model to RAM - you just drop in another DIMM, and you never have to worry about which DIMM you're using.

Checksumming may not be compelling on a laptop, as a boot drive - perhaps we'll stick with HFS+ CS. However, to talk to any large disk system, whether as a laptop plugging in, or a desktop/server dedicated to this storage, ZFS is pretty darned important. If you have O(100) disks, you simply let the system take care of the failures for you. Perhaps when 10 have failed, you take down the system (or hotswap), do a quick replace of the drives with any old size you have on hand (probably larger), and you're done. Tell the system to go look for them, and go do something else.

As dataset sizes grow and disks stay slow, as error rates increase... we need this.

The new thing (in this case) really is awesomer than the old thing.

6:04 AM, October 08, 2007
Anonymous said...

I haven't picked up an much - or any - negativity regarding ZFS in the Mac community. In fact, the opposite is true - lots of people (me included) were getting very excited that this could be available in 10.5.

It seems pretty clear to me that ZFS is the way forward - it's only a matter of time. Just the ability to drop in a new disk and have it add capacity to the storage pool, instead of at a specific mount point, will be of massive benefit to almost all users.

I only hope that the apparent self-destruction of Sun doesn't mean that great developments such as ZFS will no longer come about.

6:34 AM, October 08, 2007
Anonymous said...

...as for the case sensitivity issue: Apple has this right - preserve case, but ignore it. The simple answer as to why this is correct is that otherwise you could have (by my calculations) 512 DIFFERENT files all called 'readme.txt'

Can you imagine saving over a file, but miscapitalising a character ? Now you have a new file. Can you now imagine telling someone to open 'readme.txt' - which one ? !

...and that's just for a file with a 6 character name and a TLA. A more realistic file name such as 'My geography homework assignment 08/10/07.doc' has something like 4,294,967,296 different capitalisations.....

6:47 AM, October 08, 2007
Anonymous said...

We speak and write in a case-sensitive language, texters and slackers aside.

Uhm, no we don't.

Spoken language predates written language. And even then, the older written languages have no concept of case. It's not possible to convey case via spoken language. Hell, a lot of people can't even spell correctly.

8:45 AM, October 08, 2007
Drew Thaler said...

Jens: Okay, three places.
Zach: Haven't used Deep Freeze, but my first guess is that they create a virtual IOBlockStorageDriver and redirect writes elsewhere so that they can be flushed. It sounds like a snapshot, but not as flexible or automatic as ZFS.
LKM: You're right. However, the subheadline 'it's fantastic, but not for macs as a "default" anything' doesn't inspire confidence in me.
Emanuele and LKM: Time Machine uses a snapshot metaphor for its backups. Snapshots fit the same metaphor nicely.

Regarding case-insensitivity: Mike Ash summed it up best, I think. Taking it out of the bottommost layers of the storage stack doesn't change anything for users or developers, because apps need to work correctly in case-sensitive environments like HFSX anyway. I promise I will write up a separate blog post with more details since this comment thread is getting out of control.

Regarding laptops: The future doesn't need to be like the past. Just because notebooks have been single-drive machines doesn't mean they need to stay that way. What if the next laptop had an array of four or more 64GB flash sticks? With HFS+ they'd need to be wrapped up and presented as a single disk, meaning that if one dies the whole thing dies. With ZFS they could be managed as a pool — a lot more like RAM is. Your laptop would be thinner, more shock-resistant, and way easier to upgrade.

11:51 AM, October 08, 2007
Unknown said...

You mention that ZFS solves the huge storage space issues, say like backing up a terabyte of data. I'm wondering if you could provide more detail. Mostly, I'm interested, because I tend to run into this problem all the time. My laptop has a 100GB hard drive, and I have 2 500GB FireWire/USB2 drives, 4 250GB FireWire, and then some random 100/150GB drives.

For a while now, I've been trying to transfer data from my 250GB drive that contained music to one of my 500GB drives, and the process has been horrible. I can't let it run overnight, because it stops as soon as it hits an error. And I don't have time to run it during the day, due to client calls etc.

So now my music collection is split across two drives, with repeats, etc.; and the same thing happens with my back-up files for clients—but its not as important to me that they remain unique.

For myself, I just see this getting worse and worse. I now purchase most of my TV Shows through iTunes, at a savings from my former $90/mo cable bill, but at a cost of storage space. So between music, tv, large database and graphic files, plus my User's folder. I'm already over a 1TB storage and growing, and I find it a pain just to transfer files, not to mention back-up.

1:41 PM, October 08, 2007
Anonymous said...

Drew,

ZFS uses a smarter cache eviction algorithm than OSX's UBC, which lets it deal well with data that is streamed and only read once.

This algorithm could (and probably should) be integrated into the OSX UBC, where it will benefit all filesystems, not just ZFS.

That is actually one of the few problems I have with ZFS: it sets off my layering violation radar. It's so much more than just a filesystem. I don't know how much of that integration is necessary to deliver the ambitious feature set, but I'd feel more comfortable if some effort was put into splitting out the buffer cache, LVM, etc. from the actual FS code.

When you are writing code that will last a decade or more, you need to think about tomorrow's problems and start solving them right now.

I couldn't agree more with this, and on that basis I'm happy to see Apple working on incorporating ZFS.

I think some of the backlash against the ZFS stories is a reaction to the breathless hype from people who saw it showing up in early Leopard seeds and immediately started talking about how ZFS cures cancer, etc., and leapt to the conclusion that it'd be the default Leopard FS. My favorite was all the stuff claiming ZFS would be necessary to do Time Machine.

The current AppleInsider story still has some of that feel; they're way overselling the prospect of ZFS becoming useful to end users in the short to mid term.

4:23 PM, October 08, 2007
Travis Butler said...

This comment has been removed by the author.

10:57 PM, October 08, 2007
Anonymous said...

@ the "who doesn't use the Apple Save Dialog" crowd

Seriously? I have no less than a dozen programs in my dock right now that do not use the standard Apple save dialog including (as mentioned by others) nearly all Adobe software.

@ the "But your aunt might confuse file.txt and file .txt too" crowd

Umm, so? Aside from being a classic logical fallacy, are you really arguing that if you can't come up with a 100%, completely infallible solution, then you should try to make things easier for the user?

Also, there's a key difference there. R and r are THE SAME LETTER. They may not be to a computer or typographers, but in the minds of the vast majority of people they are the same thing.

But let's flip it around. What exactly is so beneficial about case-sensitivity aside from performance (which, as the original article states, is plentiful).

11:03 PM, October 08, 2007
Anonymous said...

Also, horror of horrors, there are all of those UNIX/X11 apps out there that run on Mac OS X and don’t use the Mac OS X Save dialog.

1:15 AM, October 09, 2007
LKM said...

@Drew Thaler: UI metaphor != implementation. While Time Machine looks similar to ZFS's snapshots, it's actually something very different. As I've said, one is a local state, the other one is backup.

3:00 AM, October 09, 2007
Anonymous said...

Hi. I'm another storage & file system weenie, having worked on SCSI, RAID, FibreChannel & FireWire at Adaptec, UFS and QFS at Sun, and currently on a distributed file system.

I think the MWJ article is actually quite good, and covers a number of reasons why Apple can't readily adopt ZFS now, or without major changes.

ZFS has some good ideas; being able to check data integrity, for one. There are also advantages in bringing the file system and volume manager closer together, though I personally believe ZFS goes a bit too far here.

That said, ZFS has some limitations. One of the key points, and one which is likely to be a show-stopper for Apple, is that it does not have a file system checker. If its data structures are damaged, it simply panics. If you have one damaged block on your disk, or you encounter a bug which writes out a single bad byte, the whole disk may well become unreadable.

Another probable show-stopper is that you can't remove a disk from ZFS (yet). If you accidentally type the wrong command ("zpool add"), your USB disk attached to your laptop instantly becomes a dongle without which you can't boot your machine.

I believe Matt may well be correct about the overhead of ZFS vs. HFS on small files, but it may not be substantial in terms of storage space. An HFS+ catalog record is small compared to a ZFS dnode. More important, the copy-on-write nature of ZFS means that a simple 'touch' on an empty file on a ZFS file system on Solaris can result in 100+ I/O operations -- and in effectively random order on the disk.

Incidentally, ZFS has even worse overhead when it comes to extended attributes. It inherited a design from Solaris UFS which presents extended attributes as whole files, in a hidden directory per file! Imagine that each time the Finder wanted to look up the file type of a file, or Spotlight wanted to see comments, it had to do open, read, close for each attribute. This works for the NTFS "named streams" model, but that's fundamentally different from an "extended attribute" model. Luckily (?), few parts of Mac OS X use extended attributes today....

ZFS does not perform well (at present) in certain environments which require streaming read/write performance. This is partly because it's not extent-based and it doesn't try to write sequentially. Its maximum block size at present is 128KB, which means that in a 5-disk array, it writes only 32KB per I/O to each disk. This results in HUGE overhead on the storage side....

CPU performance, too, is an issue for file systems. It's easy to say that CPUs are getting faster than disk; but it's not always a good tradeoff to simply use more CPU to try and reduce disk I/O. If your CPU is actually busy -- and we're getting better at writing applications which can overlap I/O with computation -- then using extra CPU in the file system is not a good idea. There are a lot of important applications (think of video processing) in which CPU is the bottleneck, rather than I/O.

Two minor issues with comments --

1) The oft-quoted CERN report indicates that the particular hardware they were studying is actually using very cheap hard drives, and that most of the problems they ran into were related to bugs in those drives' firmware. One could argue that anyone could run into this, but I happen to know that Apple, for one, does intensive qualification of the drives they include in their systems. (At least they did a few years ago. Hopefully they haven't stopped in the name of saving money.)

2) Case-insensitivity is not in ZFS yet. The reference to Sun's ARC indicates that a project has been approved, but doesn't say anything about its current state, or even whether it's been halted.

There are some great ideas in HFS+ and some great ideas in ZFS. (And there are some in QFS and the many other file systems out there, too.) We'll never have the "one true file system" because there are greatly differing I/O needs. I'm sure that Apple will be evaluating ZFS for its locally-attached storage needs, and perhaps one day we'll see a change from HFS+; but it certainly won't be to ZFS as it stands today.

In my opinion, of course. ;-)

-- Anton

3:23 AM, October 09, 2007
Anonymous said...

Re: Anton -

You're right, ZFS does not have a fsck tool. Because it doesn't need one. The ZFS repair model is to repair the storage pool online.

ZFS stores redundant copies of its data structures. The uberblock is replicated four times (well, actually, the current uberblock is replicated four times, and there are four copies of previous uberblocks as well). Directory and file inodes are replicated as well. If ZFS encounters a corrupted inode, it will retry using a different copy; it will not simply panic.

Now having said that, yes, there are some cases where ZFS still panics. Those are bugs and are being worked on by Sun. ZFS is still a young filesystem (not even Sun uses it as the default fs for Solaris).

8:30 AM, October 09, 2007
David Goldsmith said...

This comment has been removed by the author.

1:24 PM, October 09, 2007
David Goldsmith said...

One of the coolest capabilities of ZFS is the ability not only to roll back state, but to save it and restore it. I blogged about it today at http://blogs.sun.com/openroad/entry/
zfs_and_file_system_state

1:27 PM, October 09, 2007
Drew Thaler said...

Alnisa: You're hitting several problems at once: data corruption and the "big floppy" storage model. I'll make sure I cover both in my next post responding to MWJ's response.

Anonymous: Yes, ARC should absolutely go into the UBC too. I'm actually not quite clear on the fundamental reasons why the ZFS cache is so closely interwoven — perhaps its internal-consistency logic, perhaps it lets it be smarter about caching hot data uncompressed, and leaving cool data compressed. Jeff, if you're reading this, that would make a nice blog entry.

LKM: Frontend != backend. The frontend snapshot metaphor that the Time Machine UI uses could easily be upgraded to merge two backend data sources: ZFS snapshots and TM's own backups.

Anton: There are definitely show-stoppers that keep it from being shipped right now, today. But zpool remove is going to show up, booting is going to happen, performance is being improved further. Like Derek said, in any filesystem panics are used to indicate a bug that needs fixing. Consistency is not a problem, modulo bugs: the chained checksums literally validate the entire tree. (Something you don't get with HFS+, btw.) Fuzzing attacks which create invalid data structures with valid checksums — which includes buggy external implementations that might corrupt portable drives — probably fall under the category of "mild to serious bugs" if they are handled with a panic instead of failing gracefully. They aren't necessarily showstoppers, though, since it's not particularly different from the status quo on Mac OS X today.

3:51 PM, October 09, 2007
Louis Gerbarg said...

I find MWJ's analysis on this to be of very poor quality. They wrote a terse and very snide piece with a bunch of incorrect info. They wrote a much longer response when Drew criticized them, yet they still are posting really bad data:

They go into a ton of minutiae about things like fatzap, without understanding how it fully works. That is below the level of detail necessary for the discussion.

They keep asserting that checksums on single drive configs are worthless. They are not. A year ago a had a drive error and lost a block of my btree which ruined a whole HFS partition. On ZFS any blocks can be ditto'ed so that even a single drive there are two copies. By default volume data is. That may not save a particular file, but it would probably save quite a few whole partition failures. Even without ditto blocks just knowing if that data is good or not has value.

They totally misconstrued Drew's points about the size of data and growth over time.

They keep asserting that somehow HFS+ is a lot more modern than it is. The HFS+ implementation is Mac OS X is nothing short of phenomenal. But HFS+ really is an extension to HFS. HFS had some hard coded limits based on the size of its data structures, so Apple increased the size of them creating an incompatible on disk format change. At the same time they made some other changes that they felt were good. But fundamentally the structure is the same as HFS was back when it was first implemented. They are so similar that HFS and HFS+ can be handled by the same FS module in the kernel. The basic design is over 20 years old. It is very good for what it is, but there are a number of things it just cannot do without incompatible format changes and a lot of work. Things which ZFS already does.

I also used to work for Apple (performance tuning and mobility work). I am not a storage expert, but I have a very good grasp of various filesystem behaviors as they relate to disk IO and drive throughput. While I absolutely think ZFS currently has some issues that make inappropriate for a default FS at this time, the analysis MWJ has presented publicly is so unprofessional as to be laughable.

7:49 PM, October 09, 2007
Anonymous said...

Derek Moor wrote:
>You're right, ZFS does not have a fsck tool. Because
>it doesn't need one. The ZFS repair model is to
>repair the storage pool online.

Actually, the current ZFS repair model is to reformat your pool and reload from backup.

Search for bug 6458218, for instance, on the OpenSolaris site, or search for 'assertion failed' on the zfs-discuss forum....

-- Anton

1:53 AM, October 10, 2007
James Robinson said...

"Another probable show-stopper is that you can't remove a disk from ZFS (yet). If you accidentally type the wrong command ("zpool add"), your USB disk attached to your laptop instantly becomes a dongle without which you can't boot your machine."

Did anyone else read this and imagine doing this on purpose? If your laptop won't boot without the USB dongle on your keychain, you have instant, cheap data security in case of theft.

Just sayin'.

12:46 PM, October 10, 2007
Anonymous said...

Anton - I searched for the bug in question and got six hits.

6458218 - Fixed
2148249 - Fixed
6527325 - Fixed
6260386 - Closed
6181791 - Closed
6537415 - Closed

So I'm not sure I understand your post.

2:00 PM, October 10, 2007
Anonymous said...

If HFS+ is truly case insensitive, then why is is possible to create one file named "Straße" and one file named "STRASSE" in the same folder? Shouldn't these two names refer to the same file?

Case insensitive filenames just create more confusion in non-english speaking places.

5:10 PM, October 10, 2007
Travis Butler said...

(Sorry if I sounded a bit peeved last post; I've been without 'net access since 9/20, and having to access from various public hotspots when I've had opportunity.)

It's been a long time since I've done any native-code programming (think TML Pascal, if anyone remembers it :) ), but yes, I'm aware of the need to make tradeoffs; it's still pretty important in the database work I'm doing now.

The problem I had with your argument is that it seems to belittle (especially with the crack about floppies) the importance of a tradeoff that is very front-and-center for me, one that I deal with daily, and one that you ignored completely in your reply (as it concentrated on speed and not disk space); as I said, my laptop hard drive is in a state of constant triage, trying to keep 4-6 gig of free working space for the operating system. The best solution I've been able to manage is one of the small bus-powered portable drives, as an archive where I can keep some of the datafiles I don't absolutely need to keep online at any given time, like my image archive. This is not a very good solution, though; the time I move something off into the archive is usually not long before I realize I want to get to it again, and a laptop is used in a lot of places where it's awkward to have an external drive hanging off by a short cable.

(This use case is also why I'm wary of the storage pool concept; as others have noted in some of the follow-up articles to your post, it assumes that once you plug a drive into your computer, it stays there. This may be great for servers, but I do a lot of hot-swapping drives around between the desktop and the laptop, hooking the archive to the laptop, and carrying drives back and forth between the office. I used a system where the data on removable storage was added seamlessly to the 'data soup' when you plugged it in, back with the Newton; it was a royal PITA back then.)

Moreover, I understand very well the need to look to the future in design - but I don't see any change to this issue in the foreseeable future. This has been a problem with the mass storage on any computer I've had going back to the 5 MB Bernoulli Box I had running on a Mac 512E, to the point where one of my standard maxims is 'The datastore will always expand to fill the space available to it.' So as far as I'm concerned, tradeoffs involving the use of significant amounts of disk space are not trivial, and I don't see them ever becoming trivial, in anything but a relative fashion. (25K may have been a huge tradeoff in the days of 400K floppies, but it's trivial today; however, we're not talking about 25K here, are we?)

It's interesting to read Rick Schaut's comments on this in his blog entry on the Word 6.0 debacle. The impression I got is that they made all kinds of compromises to make a program that would run in a small amount of memory but really needed much more - IOW, designing for the future at the cost of the present - and this is the main reason Word 6 got such a reputation as a slow, bloated resource hog. (Not that there weren't other reasons people hated Word 6.) So that's the other corollary to my thesis - it's all very well and good to talk about laptops with banks of removable flash memory as their main storage, but when designing a system for release Today, you need to make it run well on Today's systems. I don't expect Leopard to run well on a 500mhz TiBook, but it should run decently on systems a couple of generations old and run like lightning on everything being sold today. Likewise, if ZFS is planned for Leopard's successor, it should absolutely run well on everything being sold today - not just run like lightning on whatever systems are being sold in ~two years.

5:32 PM, October 10, 2007
Anonymous said...

Derek: Re bug 6458218, see the comment from Matt Ahrens (current ZFS team lead) on how to recover from it:

To recover from this situation, try running build 60 or later, and put
'set zfs:zfs_recover=1' in /etc/system. This should allow you to read
your pool again. (However, we can't recommend running in this state
forever; you should backup and restore your pool ASAP.)

Some bugs in ZFS have allowed the user to still read their data with kernel hacks, like this one. But because there is no fsck, the recovery mechanism is reformat-and-restore. There is no other way to get back to a consistent on-disk state.

Other bugs simply require reformatting the pool and don't allow you to get your data back at all:

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg07796.html

-- Anton

5:40 PM, October 10, 2007
Drew Thaler said...

Travis: No worries, I definitely understand. A commenter on another post pointed me at this post from Eric Kustarz: ZFS on a laptop? It's got some nifty points.

As for reducing storage pressure on a self-contained system like a notebook:

1. Compression. HFS+ doesn't have any built-in support for it. You can do it manually, but it's a lot nicer to just flip a switch to enable it automatically. You may not want to use compression permanently, but it can help relieve pressure when you get tight.

2. Laptop pools. I mentioned this in a comment above: the future doesn't have to be like the past. With flash getting more popular and larger capacity, a future laptop could use multiple flash drives instead of one big drive. Let's hypothesize two things that don't exist yet, but will probably exist soon: (1) something like zpool remove that migrates all data off an existing drive so that it can be removed, and (2) an ultralight which uses four 64GB flash drives = 256GB pool.

Hey, look, the new 128GB flash drives just came out. If we have 64GB of free space, we can just zpool remove one of the 64s and add a 128 in its place. If we don't, we could temporarily add a 64GB external drive to hold the overflow. Need more space? Replace another 64 with a 128. And so on. Incremental hard disk upgrades! Pretty sweet.

I don't think RAID-Z lends itself to this sort of setup very well, because afaik it requires identical drives. If there were a intermediate level between "unprotected pool" and "full RAID" which allowed asymmetric setups (without the same level of guaranteed protection, of course), you might even be able to hot-swap drives from smaller to larger.

6:27 PM, October 11, 2007
Drew Thaler said...

By the way, that sort of incremental upgrading isn't sci-fi. A product called Drobo does this kind of live, incremental storage upgrading today.

6:29 PM, October 11, 2007
Anonymous said...

Drew wrote: "1. Compression. HFS+ doesn't have any built-in support for it. You can do it manually, but it's a lot nicer to just flip a switch to enable it automatically. You may not want to use compression permanently, but it can help relieve pressure when you get tight."

Perhaps read-only files could be stored on a compressed disk image. Probably wouldn't work well for images, but would save space taken up by documentation and similar content.

If the files are normally in specific locations on the disk, create symlinks from the expected location to the file on the image. Set up the image as something that is opened on login, and Bob's your uncle, unless I'm missing something fundamental about the implementation of compressed images that would make this scheme silly.

9:52 PM, October 13, 2007

Recording Artist

Saturday, October 06, 2007

Don't be a ZFS Hater

A word about my background

What bugged me about the article

ZFS is cool

Still not convinced?

Updates:

63 comments:

0 references:

About Me

Sections

Archive

Geekery

Magazines

Politics