Sunday, December 02, 2007

The Case Against Insensitivity  

One of the most controversial parts of my earlier post, Don't Be a ZFS Hater, was when I mentioned off-handedly in the comments that I don't like case-insensitivity in filesystems.

Boy, did that spur a storm of replies.

I resolved to not pollute the ZFS discussion with a discussion of case-insensitivity and promised to make a separate blog post about it. It took a while, but this is that post. I blame a busy work schedule and an even busier travel schedule. (Recently in the span of two weeks I was in California, Ohio, London, Liverpool, London, Bristol, London, Amsterdam, London, then back to Ohio. Phew!)

Here's Why Case-Insensitive Filesystems Are Bad

I've worked in and around filesystems for most of my career; if not in the filesystem itself then usually a layer just above or below it. I'm speaking from experience when I tell you:

Case-insensitivity is a bad idea in filesystems.

And here's why:

  1. It's poorly defined.
  2. Every filesystem does it differently.
  3. Case-insensitivity is a layering violation.
  4. Case-insensitivity forces layering violations upon other code.
  5. Case-insensitivity is contagious.
  6. Case-insensitivity adds complexity and provides no actual benefit.

I'll expand on each of these below.

It's poorly defined

When I say "case-insensitive", what does that mean to you?

If you only speak one language and that language is English, it probably seems perfectly reasonable: map the letter a to A, b to B, and so on through z to Z. There, you're done. What was so hard about that?

But that's ASCII thinking; the world left that behind a long time ago. Modern systems are expected to deal with case differences in all sorts of languages. Instead of a simple 26-letter transformation, "case insensitivity" really means handling all the other alphabets too.

The problem with doing that, however, is that it brings language and orthography into the picture. And human languages are inherently vague, large, messy, and constantly evolving.

Can you make a strict definition of "case insensitivity" without any hand-waving?

One way to do it is with an equivalence table: start listing all the characters that are equal to other characters. We can go through all the variants of Latin alphabets, including a huge list of accents: acute, grave, circumflex, umlaut, tilde, cedilla, macron, breve, dot, ring, ogonek, hacek, and bar. Don't forget to find all the special ligatures and other letters, too, such as Æ vs æ and Ø vs ø.

Okay, our table is pretty big so far. Now let's start adding in other alphabets with case: Greek, Armenian, and the Cyrillic alphabets. And don't forget the more obscure ones, like Coptic. Phew. It's getting pretty big.

Did we miss any? Well, for any given version of the Unicode standard it's always possible to enumerate all letters, so it's certainly possible to do all the legwork and prove that we've got all the case mappings for, say, Unicode 5.0.0 which is the latest at the time of this writing. But Unicode is an evolving standard and new characters are added frequently. Every time a new script with case is added we'll need to update our table.

There are also some other hard questions for case insensitivity:

  • Digraph characters may have three equivalent mappings, depending on how they are being written: all-lowercase, all-uppercase, or title-case. (For example: dz, DZ, or Dz.) But this breaks some case-mapping tables which didn't anticipate the need for an N-way equivalence.

  • The German letter ß is considered equal to lowercase ss. Should "Straße" and "STRASSE" be considered equivalent? They are in German. But this breaks some case-mapping tables which didn't anticipate the need for an N-to-M character translation (1:2, in this case).

  • Capital letters can significantly alter the meaning of a word or phrase. In German, capital letters indicate nouns, so the word Essen means "food", while the word essen means "to eat". We make similar distinctions in English between proper nouns and regular nouns: God vs god, China vs china, Turkey vs turkey, and so on. Should "essen" and "Essen", or "china" and "China" really be considered equivalent?

  • Some Hebrew letters use different forms when at the end of a word, such as פ vs ף, or נ vs ן. Are these equivalent?

  • In Georgian, people recently experimented with using an obsolete alphabet called Asomtavruli to reintroduce capital letters to the written language. What if this had caught on?

  • What about any future characters which are not present in the current version of the Unicode standard?

Case is a concept that is built into written languages. And human language is inherently messy. This means that case-insensitivity is always going to be poorly defined, no matter how hard we try.

Every filesystem does it differently

Unfortunately, filesystems can't engage in hand-waving. Filesystem data must be persistent and forward-compatible. People expect that the data they wrote to a disk last year should still be readable this year, even if they've had an operating system upgrade.

That's a perfectly reasonable expectation. But it means that the on-disk filesystem specification needs to freeze and stop changing when it's released to the world.

Because our notion of what exactly "case-insensitive" means has changed over the past twenty years, however, we've seen a number of different methods of case-insensitivity emerge.

Here are a handful of the most popular case-insensitive filesystems and how they handle case-mapping:

  • FAT-32: ASCII upper- and lower-case letters, but a-z and A-Z are considered identical. Also variable IBM code pages in high ASCII.
  • HFS: ASCII upper- and lower-case letters, but a-z and A-Z are considered identical. Also variable Mac encodings in high ASCII.
  • NTFS: Case-insensitive in different ways depending on the version of Windows that created the volume.
  • HFS+: Case-insensitive with a mapping table which was frozen circa 1996, and thus lacks case mappings for any newer characters.

None of these — except for NTFS created by Vista — are actually up-to-date with the current Unicode specification. That's because they all predate it. Similarly, if a new filesystem were to introduce case-insensitivity today, it would be locked into, say, Unicode 5.0.0's case mappings. And that would be all well and good until Unicode 5.1.0 came along.

The history of filesystems is littered with broken historical case mappings like a trail of tears.

Case-insensitivity is a layering violation

When people argue for case-insensitivity in the filesystem, they almost always give user interface reasons for it. (The only other arguments I've seen are based on contagion, which I'll talk about in a moment.) Here is the canonical example:

My Aunt Tillie doesn't know the difference between letter.txt and Letter.txt. The filesystem should help her out.

But in fact this is a UI problem. The problem relates to the display and management of information, not the storage of this information.

Don't believe me?

  • When any application displays items in a window, who sorts them case-insensitively? The filesystem? No! The application does it.

  • When you type-select, typing b-a-b-y to select the folder "Baby Pictures" in an application, who does the case-insensitive mapping of the letters you type to the files you select? The filesystem? No! The application again.

  • When you save or copy files, who does the case-insensitive test to warn you if you're creating "file.txt" when "File.txt" already exists? The filesystem? Yes!

Why does the third question have a different answer than the rest?

And we've already talked about how filesystems are chronically out-of-date with their case mappings. If your aunt is a Turkish Mac user, for example, she's probably going to notice that the behavior of the third one is different for no good reason. Why are you confusing your Aunt Tülay?

One last point was summarized nicely by Mike Ash in the comments of Don't Be a ZFS Hater. I'll just quote him wholesale here:

Yes, Aunt Tillie will think that "Muffin Recipe.rtf" and "muffin recipe.rtf" ought to be the same file. But you know what? She'll also think that "Muffin Recipe .rtf" and "Recipe for Muffins.rtf" and "Mufin Recipe.txt" ought to be the same file too.

Users already don't generally understand how the OS decides whether two files are the same or not. Trying to alleviate this problem by mapping names with different case to the same file solves only 1% of the problem and just isn't worth the effort.

I agree completely.

Case-insensitivity forces layering violations upon other code

All too often, pieces of code around the system are required to hard-code knowledge about case-insensitive filesystem behavior. Here are a few examples off the top of my head:

  • Collision prediction. An application may need to know if two files would conflict before it actually writes either of them to disk. If you are writing an application where a user creates a group of documents — a web page editor, perhaps — you may need to know when banana.jpg and BANANA.JPG will conflict.

    The most common way that programmers solve this is by hard-coding some knowledge about the case-insensitivity of the filesystem in their code. That's a classic layering violation.

  • Filename hashing. If you are writing code to hash strings that are filenames, you probably want equivalent paths to generate the same hash. But it's impossible to know which files are equivalent unless you know the filesystem's rules for case-mapping.

    Again, the most common solution is a layering violation. You either hard-code some knowledge about the case-insensitivity tables, or you hard-code some knowledge about your input data. (For example, you may just require that you'll never, never, ever have multiple access paths for the same file in your input data. Like all layering violations, that might work wonderfully for a while ... right up until the day that it fails miserably.)

I'm sure there are more examples out there.

Case-insensitivity is contagious

This is the worst part. It's all too easy to accidentally introduce a dependence on case-insensitivity: just use an incorrect path with bad case.

The moment somebody creates an application or other system that inadvertently depends on case-insensitivity, it forces people to use a case-insensitive filesystem if they want to use that app or system. And that's one of the major reasons why case-insensitivity has stuck around — because it's historically been very difficult to get rid of.

I've seen this happen with:

  • Source code. Some bozo writes #include "utils.h" when the file is named Utils.h. Sounds innocent enough, until you find that it's repeated dozens of times across hundreds of files. Now that project can only ever be compiled on a case-insensitive filesystem.

  • Game assets. A game tries to load lipsync.dat instead of LIPSYNC.DAT. Without knowing it, the artist or developer has accidentally locked that game so that it can only run on a case-insensitive filesystem. (This causes real, constant problems in game pipelines; teams create and test their games on case-insensitive NTFS and don't notice such problems until it's burned to a case-sensitive UDF filesystem on DVD or Blu-Ray.)

  • Application libraries. DLLs and shared library references are sometimes generated by a build script which uses the wrong case. When that happens, the application may simply fail to launch from a case-sensitive filesystem.

  • Miscellaneous data files. Sometimes an application will appear to run on a case-sensitive filesystem but some feature will fail to work because it fails to load a critical data file: the spell-checking dictionary, a required font, a nib, you name it.

Happily, since Mac OS X shipped in 2001, Apple has been busy solving its own problems with case-insensitivity and encouraging its developers to test with case-sensitive filesystems. Two important initiatives in this direction have been NFS home directories and case-sensitive HFSX.

The upshot of it is that Mac OS X is actually very friendly to case-sensitive disks these days; very little that's bad happens when you use case-sensitive HFSX today.

Case-insensitivity adds complexity with no actual benefit

I'm going to make an assertion here:

ONE HUNDRED PERCENT of the path lookups happening on your Mac right now are made with correct case.

Think about that for a moment.

First off, you may think this contradicts the point I just made in the previous section. Nope; I'm simply rounding. The actual figure is something like 99.999%, and I'd probably get tired of typing 9's before I actually approached the real number. There are infinitesimally few path accesses made with incorrect case compared to the ones that are made with the proper case.

Modern computers make hundreds of filesystem accesses per second. As I type this single sentence in MarsEdit on Mac OS X 10.4.11, my computer has made 3692 filesystem accesses by path. (Yes, really. MarsEdit's "Preview" window is invoking Perl to run Markdown, which loads a handful of modules, and then WebKit re-renders the page. That's a lot of it, but meanwhile there's background activity from Mail, Activity Monitor, iChat, SystemUIServer, iCalAlarmScheduler, AirPort Base Station Agent, Radioshift, NetNewsWire, Twitterrific, and Safari.)

Under Mac OS X you can measure it yourself with this command in Terminal:

  sudo fs_usage -f filesys | grep / > /tmp/accesses.txt

The vast majority of file accesses are made with paths that were returned from the filesystem itself: some bit of code read the contents of a directory, and passed the results on to another bit of code, which eventually decided to access one of those files. So most of the time the filesystem is getting back the paths that it has returned earlier. Very very few accesses are made with paths that come directly from an error-prone human, which is why essentially 100% of filesystem accesses are made with correct case.

But if essentially all filesystem accesses are made with the correct case to begin with, why do we even have case-insensitivity at all?

We've already discussed the problems of contagion, which is a circular justification: we have to do it because someone else did it first. We've also discussed UI decisions being incorrectly implemented in the bottommost layer of the operating system. Other than those two, what good is it?

I don't have an answer to that. For the life of me I can't come up with any reason to justify case-insensitive filesystems from a pure design standpoint. That leads me to my closing argument, which is...

A thought experiment

Suppose case-insensitive filesystems had never been invented. You're the leader of a team of engineers in charge of XYZZYFS, the next big thing in filesystems. One day you tell the other people who work on it:

"Hey! I've got this great idea! It's called case-insensitivity. We'll take every path that comes into the filesystem and compare it against a huge table to create a case-folded version of the path which we'll use for comparisons and sorting. This will add a bunch of complexity to the code, slow down all path lookups, increase our RAM footprint, make it more difficult for users of our filesystem to handle paths, and create a compatibility nightmare for future versions if we ever decide to change the table. But, you see, it'll all be worth it, because... _________________."

Can you fill in the blank?

Wednesday, October 10, 2007

ZFS Hater Redux  

MWJ has responded to my last post, Don't Be a ZFS Hater, with a post of their own: You don't have to hate ZFS to know it's wrong for you.

I don't like the point-by-point quote and response format — it's way too much like an old-school Usenet flamewar. So I will simply try to hit the high points of their arguments.

Where we agree

  • ZFS is not ready to deploy to the entire Mac OS X user base today. There's still some work to be done.

  • ZFS isn't necessary for most of today's Macintosh computers. If you have been using your Mac with no storage-related problems, then you can keep on using it that way. Perform regular backups and you'll be just fine.

  • It would be an absolutely terrible idea to take people's perfectly working HFS+ installations on existing computers and forcibly convert them to ZFS, chuckling evilly all the while. Not quite sure where that strawman came from.

  • ZFS fatzaps are expensive for small files. If it were true that 20% of the files in a Mac OS X installation required a fatzap (pdf link to ZFS-on-disk specification), that would indeed be unnecessarily wasteful.

  • A typical Mac OS X 10.4.x installation has on the order of about 600,000 files.

I think that's about it. But of course there are a number of places where we disagree too.

ZFS would be awfully nice for a small segment of the Mac OS X user base if it were ready today.

If you spend any amount of time managing storage — if drives have gone bad on you, if you have ever run out of space on a desktop system and needed to add a drive (or two), if you have a RAID array — then you are the sort of user that could see some immediate benefit.

But of course as we already agreed, it's not ready today. You haven't been "cheated" and I'm sure you don't feel that way. But feel free to look forward to it: I sure am.

ZFS — or something with all the features of ZFS — will be more than nice, it will be necessary for tomorrow's Macintosh computers.

Both storage sizes and consumer consumption of storage grow exponentially. I tried to make this point last time, but MWJ seems to have misunderstood and accused me of misquoting. Let's try again.

In 1997, 20GB of storage meant a server RAID array. Ten years later, in 2007, 20GB of storage is considered "not enough" by most people. Across my entire household I have drives larger than that in my computer, in my TiVo, in my PlayStation 3, and even in my iPod. Now let's extrapolate that into the future.

In 2007, 20TB of storage means a server RAID array. Ten years from now, in 2017, 20TB of storage will similarly be considered "not enough". MWJ scoffed at ZFS because it's really pretty good at the problems of large storage. But you know what? A solution to managing that much data will need to be in place in Mac OS X well before 20TB drives become the norm. Better hope someone's working on it today.

Meanwhile — and this is what scares the pants off me — the reliability numbers for hard drives have improved much more slowly than capacity.

Here's a fairly typical Seagate drive with a capacity of ~150GB = ~1.2 x 1012 bits. The recoverable error rate is listed as 10 bits per 1012 bits. Let's put those numbers together. That means that if you read the entire surface of the disk, you'll typically get twelve bits back that are wrong and which a retry could have fixed. (Updated Oct 11 2007: In the comments, Anton corrected me: I should've used the unrecoverable error rate here, not the recoverable error rate. The net result is that in ideal operating conditions bit errors occur over 100x less frequently than I originally suggested. However, it's still not zero. The net result is still a looming problem when you scale it across (installed base) x (storage consumption) x (time). See the comment thread.)

Yes, really. Did you catch the implications of that? Silent single-bit errors are happening today. They happen much more often at high-end capacities and utilizations, and we often get lucky because some types of data (video, audio, etc) are resistant to that kind of single-bit error. But today's high end is tomorrow's medium end, and the day after tomorrow's low end. This problem is only going to get worse.

Worse, bit errors are cumulative. If you read and get a bit error, you might wind up writing it back out to disk too. Oops! Now that bit error just went from transient to permanent.

Still think end-to-end data integrity isn't worth it?

Apple using ZFS rather than writing their own is a smart choice.

As I hope I made abundantly clear in the last post, extending HFS+ to the future that we can see looming is just not an option — its structure is simply too far removed from these problems. It's really just not worth it. It's pretty awesome that the original HFS design scaled as far as it did: how many people can come up with a 20-year filesystem? But you have to know when to throw in the towel.

So if you accept that the things I described above are real, looming problems, then Apple really does need a filesystem with at least several of the more important attributes of ZFS.

The choices at this point are essentially twofold: (1) start completely from scratch, or (2) use ZFS. There's really no point in starting over. ZFS has a usable license and has been under development for at least five years by now. By the time you started over and burned five years on catching up it would be too late.

And I really do want to reiterate that the shared community of engineers from Apple, Sun, and FreeBSD working on ZFS is a real and measurable benefit. I've heard as much from friends in CoreOS. I can't understand the hostility to this very clear and obvious fact. It's as if Apple suddenly doubled or tripled the number of filesystem engineers it has available, snagging some really brilliant guys at the top of their profession in the process, and then multiplied its testing force by a factor of 10.

(To respond to a query voiced by MWJ, HFS+ never gathered that community when it was open-sourced because the design was already quite old at that point. It frankly didn't have anything new and exciting to offer, and it was saddled with performance problems and historical compromises of various kinds, so very few people were interested in it.)

ZFS fatzaps are unlikely to be a significant problem.

This gets a bit technical. Please skip this section if you don't care about this level of detail.

MWJ really pounded on this one. That was a bit weird to me, since it seemed to be suggesting that Apple would not expend any engineering effort on solving any obvious glaring problems with ZFS before releasing it. That's not the Apple I know.

But okay, let's suppose that we're stuck with ZFS and Mac OS X both frozen as they stand today. Let's try to make an a priori prediction of the actual cost of ZFS fatzaps on a typical Mac OS X system.

  • Classic HFS attributes (FinderInfo, ExtendedFinderInfo, etc) are largely unnecessary and unused today because the Finder uses .DS_Store files instead. In the few cases where these attributes are set and used by legacy code, they should fit easily in a small number of microzaps.

  • Extended attributes may create fatzaps. Today it seems like extended attributes are typically used on large files: disk images, digital photos, etc. This may provoke squawking from the peanut gallery, but once a file is above a certain size — roughly a couple of megabytes — using an extra 128KiB is negligible. If you have a 4MiB file and you add 128KiB to track its attributes, big deal: you've added 3%. It's not nothing, but it's hardly a significant problem.

  • Another likely source of fatzaps in ZFS on Mac OS X is the resource fork. But with Classic gone, new Macs ship with virtually no resource forks on disk. There are none in the BSD subsystem. There are a handful in /System and /Library, mostly fonts. The biggest culprits are large old applications like Quicken and Microsoft Office. A quick measurement on my heavily-used one-year-old laptop shows that I have exactly 1877 resource forks out of 722210 files — that's 0.2%, not 20%.

    (Fun fact: The space that would be consumed by fatzap headers for these resource files comes out to just 235 MiB, or roughly six and a half Keyboard Software Updates. Again: not nothing, but hardly a crisis to scream about.)

Want to measure it yourself? Amit Singh's excellent hfsdebug utility will show you a quick summary. Just run "sudo hfsdebug -s" and look at the numbers for "files" and "non-zero resource forks". Or try "sudo hfsdebug -b attributes -l any | less" to examine the files which have extended attributes on your disk.

ZFS snapshots don't have to be wasteful

The cheesesteak analogy was cute. But rather than imagining that snapshots just eat and eat and eat storage until you choke in a greasy pile of death, it would help if we all actually understand how hard drive storage is actually used in practice, and how ZFS can work with that.

There are three major classes of stored data.

  • Static data is data that you want to keep and almost never modify. This is your archive. Photographs, music, digital video, applications, email, etc. Archives are additive: unless you really run out of room, you rarely delete the old — you only add new stuff. You want the contents safe and immediately accessible, but they are essentially unchanging.

Snapshotting static data is close enough to free that you won't notice: the only cost is the basic cost of the snapshot. No extraneous data copies are ever created, because you never modify or delete this stuff anyway.

  • Dynamic data is data that you want to keep, but are modifying with some frequency. This is whatever you are working on at the moment. It might be writing a novel, working in Photoshop, or writing code: in all cases you keep saving new versions over the old.

Snapshotting dynamic data is more expensive, because if you do it too much without recycling your old snapshots then you can build up a large backlog.

  • Transient data is data that should not be persistent at all. These are your temporary files: local caches, scratch files, compiler object files, downloaded zip files or disk images, etc. These may be created, modified, or deleted at any moment.

Snapshotting transient data is generally a bad idea — by definition you don't care that much about it and you'd prefer it to be deleted immediately.

Got all that? Okay. Now I need to make a couple of points.

First, I assert that virtually all of the data on personal computer hard drives is static most of the time. Think about that. The operating system is static the whole time you are using it, until you install a system update. (And even then, usually just a few hundred megabytes change out of several gigabytes.) Your /Applications folder is static. Your music is static. And so on. Usually a few percent of your data is dynamic, and a few more percent is transient. But in most cases well over 95% is static. (Exceptions are easy to come up with: Sometimes you generate a large amount of transient data while building a disk image in iDVD or importing DV footage. That can shift the ratio below 95%. But once that task is complete you're back to the original ratio.)

Second, the biggest distinction that matters when snapshotting is separating persistent data from transient data. Taking snapshots of transient data is what will waste disk space in a hurry. Taking snapshots of dynamic data as a local backup is often valuable enough that it's okay to burn the small amount of disk space that it takes, because remember: that's the actual data that you're actively working on. And as we already mentioned, snapshots of static data are free.

Now here's where it gets interesting.

With ZFS, snapshots work on the filesystem level. Because it no longer uses the "big floppy" model of storage, new filesystems are very cheap to create. (They are almost as lightweight as directories, and often used to replace them.) So let's create one or more special filesystems just for transient data and exclude them from our regular snapshot process. In fact on Mac OS X that's easy: we have well-defined directories for transient data: ~/Library/Caches, /tmp, and so on. Link those all off to one or more transient filesystems and they will never wind up in a snapshot of the important stuff. I wouldn't expect users to do this for themselves, of course — but it could certainly be set up that way automatically by Apple.

Once the transient data is out of the picture, our snapshots will consist of 95% or more static data — which is not copied in any way — and a tiny percentage of dynamic data. And remember, the dynamic data is not even copied unless and until it changes. The net effect is very similar to doing an incremental backup of exactly and only the files you are working on. This is essentially a perfect local backup: no duplication except where it's actually needed.

Will you want to allow snapshots to live forever? Of course not. One reasonable model for taking backup snapshots might be to remember 12 hourly snapshots, 7 daily snapshots, and 4 weekly snapshots. If you are getting tight on storage the system could take new snapshots less frequently and expire them more aggressively. Remember: when nothing is changing the snapshots don't take up any space.

Wrap-up: Listen to the smart guys

Some very smart people at Sun started the ball rolling by putting an awful lot of thought into the future of storage, and they came up with ZFS.

After they announced it and started talking about it, other brilliant people at Apple (and FreeBSD, and NetBSD) paid attention to what they were doing. And they listened, and thought about it, and looked at the code, and wound up coming around to the side of ZFS as well.

If you think I'm smart, just know that I'm in awe of some of the guys who've been involved with this project.

If you think I'm stupid, why, I look forward to hearing from you in the comments.

Saturday, October 06, 2007

Don't be a ZFS Hater  

John Gruber recently linked to — and thus gave credibility to — a MWJ post ripping on a fairly reasonable AppleInsider post about ZFS. Representative quote:

“We don't find HFS Plus administration to be complex, and we can't tell you what those other things mean, but they sound really cool, and therefore we want them. On the magic unlocked iPhone. For free.”

Har har har. Wait. Hold on a minute. Why is it suddenly fashionable to bash on ZFS?

Part of it is a backlash to the weird and obviously fake rumor about it becoming the default in Leopard, I guess. (No thanks to Sun's CEO Jonathan Schwartz here, who as far as I know has never publicly said anything about why he either misspoke or misunderstood what was going on back in June.)

But don't do that. Don't be a ZFS hater.

A word about my background

Let's get the credentials out of the way up front. Today I work on a file I/O subsystem for PlayStation 3 games. Before that, I worked in Apple's CoreOS filesystems group. Before that, I worked on DiscRecording.framework, and singlehandedly created the content subframework that streamed out HFS+, ISO-9660, and Joliet filesystems. Before that, I worked on the same thing for Mac OS 9. And before that, I worked on mass storage drivers for external USB/FireWire drives and internal ATA/ATAPI/SCSI drives.

You might say I know a thing or two about filesystems and storage.

What bugged me about the article

  1. ZFS is a fine candidate to replace HFS+ eventually. It's not going to happen overnight, no. And it'll be available as an option for early adopters way before it becomes the default. But several years from now? Absolutely.

  2. The bizarre rants about ZFS wasting processor time and disk space. I'm sorry, I wasn't aware that we were still using 30MHz machines with 1.44MB floppies. ZFS is great specifically because it takes two things that modern computers tend to have a surplus of — CPU time and hard disk space — and borrows a bit of it in the name of data integrity and ease of use. This tradeoff made very little sense in, say, 1992. But here in 2007 it's brilliant.

  3. Sneeringly implying that HFS+ is sufficient. Sure, HFS+ administration is simple, but it's also inflexible. It locks you into what I call the "big floppy" model of storage. This only gets more and more painful as disks get bigger and bigger. Storage management has come a long way since the original HFS was created, and ZFS administration lets you do things that HFS+ can only dream of.

  4. Claiming that RAID-Z is required for checksums to be useful. This is flat-out wrong. Sure, RAID-Z helps a lot by storing an error-correcting code. But even without RAID-Z, simply recognizing that the data is bad gets you well down the road to recovering from an error — depending on the exact nature of the problem, a simple retry loop can in fact get you the right data the second or third time. And as soon as you know there is a problem you can mark the block as bad and aggressively copy it elsewhere to preserve it. I suppose the author would prefer that the filesystem silently returned bad data?

  5. Completely ignoring Moore's Law. How dumb do you need to be to willfully ignore the fact that the things that are bleeding-edge today will be commonplace tomorrow? Twenty gigabyte disks were massive server arrays ten years ago. Today I use a hard drive ten times bigger than that just to watch TV.

Reading this article made me feel like I was back in 1996 listening to people debate cooperative vs preemptive multitasking. In the Mac community at that time there were, I'm ashamed to say, a lot of heated discussions about how preemptive threading was unnecessary. There were some people (like me) who were clamoring for a preemptive scheduler, while others defended the status quo — claiming, among other things, that Mac OS 8's cooperative threads "weren't that bad" and were "fine if you used them correctly". Um, yeah.

Since then we've thoroughly settled that debate, of course. And if you know anything about technology you might be able to understand why there's a difference between "not that bad" and "a completely new paradigm".

ZFS is cool

Let's do a short rundown of reasons why I, a qualified filesystem and storage engineer, think that ZFS is cool. I'll leave out some of the more technical reasons and just try to keep it in plain English, with links for further reading.

  • Logical Volume Management. Hard disks are no longer big floppies. They are building blocks that you can just drop in to add storage to your system. Partitioning, formatting, migrating data from old small drive to new big drive -- these all go away.

  • Adaptive Replacement Caching. ZFS uses a smarter cache eviction algorithm than OSX's UBC, which lets it deal well with data that is streamed and only read once. (Sound familiar? It could obsolete the need for F_NOCACHE.)

  • Snapshots. Think about how drastically the trash can metaphor changed the way people worked with files. Snapshots are the same concept, extended system-wide. They can eliminate entire classes of problems.

    I don't know about you, but in the past year I have done all of the following, sometimes more than once. Snapshots would've made these a non-issue:

    • installed a software update and then found out it broke something
    • held off on installing a software update because I was afraid something might break
    • lost some work in between backups
    • accidentally deleted an entire directory with a mistyped rm -rf or SCM delete command.
  • Copy-on-write in the filesystem makes snapshots super-cheap to implement. No, not "free", just "so cheap you wouldn't possibly notice". If you are grousing about wasted disk space, you don't understand how it works. Mac OS X uses copy-on-write extensively in its virtual memory system because it's both cheap and incredibly effective at reducing wasted memory. The same thing applies to the filesystem.

  • End-to-end data integrity. Journaling is the only thing HFS+ does to prevent data loss. This is hugely important, and big props are due to Dominic Giampaolo for hacking it in. But journaling only protects the write stage. Once the bits are on the disk, HFS+ simply assumes they're correct.

    But as disks get larger and cheaper, we're finding that this isn't sufficient any more. The odds of any one bit being wrong are very small. And yet the 200GB hard disk in my laptop has a capacity of about 1.6 trillion bits. The cumulative probability that EVERY SINGLE ONE of those bits are correct is effectively zero. Zero!

    Backups are one answer to this problem, but as your data set gets larger they get more and more expensive and slow. (How do you back up a terabyte's worth of data? How long does it take? Worse still, how do you really know that your backup actually worked instead of just appearing to work?) So far, disk capacity has consistently grown faster than disk speed, meaning that backups will only continue to get slower and slower. Boy, wouldn't it be great if the filesystem — which is the natural bottleneck for everything disk-related — helped you out a little more on this? ZFS does.

  • Combinations of the above. There are some pretty cool results that fall out of having all of these things together in one place. Even if HFS+ supported snapshots, you'd still be limited by the "big floppy" storage model. It really starts to get interesting when you combine snapshots with smart use of logical volume management. And we've already discussed how RAID-Z enhances ZFS's basic built-in end-to-end data integrity by adding stronger error correction. There are other cool combinations too. It all adds up to a whole which is greater than the sum of its parts.

Is any of this stuff new and unique to ZFS? Not really. Bits and pieces of everything I've mentioned above have showed up in many places.

What ZFS brings to the table is that it's the total package — everything all wrapped up in one place, already integrated, and in fact already shipping and working. If you happened to be looking for a next-generation filesystem, and Apple is, you wouldn't need to look much further than ZFS.

Still not convinced?

Okay, here are three further high-level benefits of ZFS over HFS+:

  • Designed to support Unix. There are a lot of subtleties to supporting a modern Unix system. HFS+ was not really designed for that purpose. Yes, it's been hacked up to support Unix permissions, node locking, lazy zero-fill, symbolic links, hard links, NFS readdir semantics, and more. Some of these were easy. Others were painful and exhibit subtle bugs or performance problems to this day.

  • Designed to support modern filesystem concepts. Transactions. Write cache safety. Sparse files. Extended metadata attributes. I/O sorting and priority. Multiple prefetch streams. Compression. Encryption. And that's just off the top of my head.

    HFS+ is 10-year-old code built on a 20-year-old design. It's been extended to do some of this, and could in theory be extended to do some of the others... but not all of them. You'll just have to trust me that it's getting to the point where some of this stuff is just not worth the engineering cost of hacking it in. HFS+ is great, but it's getting old and creaky.

  • Actually used by someone besides Apple. Don't underestimate the value of a shared standard. If both Sun and Apple start using the same open-source filesystem, it creates a lot of momentum behind it. Having more OS clients means that you get a lot more eyes on the code, which improves the code via bugfixes and performance enhancements. This makes the code better, which makes ZFS more attractive to new clients, which means more eyes on the code, which means the code gets better.... it's a virtuous circle.

Is ZFS the perfect filesystem? I doubt it. I'm sure it's got its limitations just like any other filesystem. In particular, wrapping a GUI around its administration options and coming up with good default parameters will be an interesting trick, and I look forward to seeing how Apple does it.

But really, seriously, dude. The kid is cool. Don't be like that.

Don't be a ZFS hater.

Updates:

MWJ's response: You don't have to be a ZFS hater to know it's wrong for you.
My followup: ZFS Hater Redux

Thursday, October 04, 2007

MarsEdit Markdown Scripts updated to 1.0.3  

No piece of software, however simple, is bug-free.

Daniel Jalkut was kind enough to point out to me that my Markdown scripts for MarsEdit didn't deal properly with Unicode text.

Silly me. I'd forgotten that AppleScript (with its very early-1990s roots) still needs to be explicitly told not to lose data when writing out files. He offered a fix — a few «class utf8» coercions in the right place and all was well again. That was 1.0.1.

However, just immediately after release, I discovered that the scripts needed to do some more aggressive transcoding of UTF8 into ASCII in order to get Python to read the file in html2text. So I've added support for that. This slows down the HTML-to-Markdown reverse conversion a little bit, but at least it's correct now. That was 1.0.2.

Finally, later in the evening I realized it was pretty stupid to write AppleScript code to transcode the UTF8 into ASCII, because AppleScript's support for Unicode is so horribly primitive and I was doing it character by character. So I rewrote that whole section as a one-line Perl script. Now it's blazing fast. And that's 1.0.3.

I've updated the scripts. Download them now!