Recording Artist: When Benchmarks Attack

Tuesday, October 31, 2006

When Benchmarks Attack

In my professional life I've found myself measured by an external benchmark many times. One time that comes to mind was when I was writing disk drivers for Mac OS 8.

The disk driver's job in Mac OS 8 was simply to pass requests from the filesystem layer to the underlying storage medium. Requests would come in for such-and-such amount of data at such-and-such an offset from the start of the volume. After some very minimal translation, we'd pass this request onward to the ATA Manager, or USB Manager, or FireWire libraries.

(Sounds easy, right? Well, yes, but the devil was in the details. USB and FireWire supported "hot-plugging", i.e. attaching or removing the drive while the computer was still running. Supporting this in an OS that was never designed for it was a big chore. CD and DVD drives were also much more difficult to handle than regular hard drives, for similar reasons.)

But we ran into a problem as we were preparing our drivers for release. The hardware manufacturers that were buying our drivers wanted to make sure our drivers were "fast". So they ran disk benchmarks against our drivers. Not an unreasonable thing to do, you might say, although as I've stated there really wasn't a lot of code in the critical path. The problem is that most of the disk benchmarks turned out to be inappropriate as measures of driver performance.

Measuring weight with a ruler

One common disk benchmark of the time was to read sequential 512-byte blocks from a large file. In Mac OS 8 these reads were passed directly down to the disk driver.

Remember, the disk driver's job was supposed to be to simply pass the incoming requests on to the next layer in the OS. The vast majority of the time would be spent accessing the hardware. So no matter what the requests were, in effect this test should have returned almost exactly the same results for all drivers, with code differences accounting for less than 1% of the results. Right? Wrong.

As we ran the tests, we found out that on this particular benchmark, our drivers were much slower than existing third-party drivers (our competition, more or less). Dramatically slower, in fact -- if we ran the test in 100 seconds, they ran it in 10 seconds.

Upon further examination, we found out the reason why they did so well on this benchmark. They had already been optimized for it!

We discovered that the other drivers contained caching logic which would cluster these small 512-byte sequential reads into larger 32KiB or so chunks. Doing so would decrease the number of round-trips to the hardware needed, which increased their performance on this benchmark.

Do what sells, not what's right

Now, it's important to understand that this tiny-sequential-read benchmark doesn't even reflect real-world use. The fact that larger I/O requests perform better than smaller I/O requests has been well known for decades, and almost every commercial application out there used large I/O requests for this very reason. Buffered I/O is built into the standard C libraries, for pete's sake. In the real world, requests that came into the disk driver tended to be either large or non-sequential.

Modifying our drivers to do the same thing would basically be pointless -- a lot of work for very little performance difference.

(It only gets worse when you realize that our disk driver was directly underneath the Mac OS disk cache. A driver-level cache was almost completely redundant. And the right place to do this sort of readahead caching would have been there, in the existing OS cache layer, not in each individual disk driver.)

But ... sigh. The real world doesn't always make sense. These small sequential reads were a well-known benchmark, even if it was a disk benchmark and not a driver benchmark. And that simple fact elevated this atypical scenario into the eyes of a lot of people who didn't really understand what it meant. To them, if our driver didn't do the same thing as the competition, we wouldn't measure up.

De-optimizing to improve the benchmark

So in order to make sales we were forced to burn another month adding and testing driver-level caching so that we'd perform better on this benchmark.

The irony? The work we did to improve this atypical benchmark actually increased the amount of time we spent in the driver code, and sometimes increased the amount of data we read, which in turn decreased real-world performance by as much as 1-2%.

But hey, at least that benchmark got faster.

And we sold our driver.

3 comments:

Anonymous said...

FWB conditionally compiled large amounts of error handling code out of the binaries which were shipped to reviewers in order to boost their performance slightly. You've never seen so many #ifdefs.

7:31 PM, November 04, 2006
Drew Thaler said...

Heh, really!? That's just sick. But based on the other stuff I've heard about their business practices I would totally believe it. And to think, there was a time when you could have bought that steaming pile for a mere $65,000 on eBay.

btw, I'm guessing that must be nate dawg of KHI, famed for San Francisco Mac consulting, since you knew immediately which other drivers I was talking about. I dig your new site. :-)

3:20 PM, November 05, 2006
Anonymous said...

Ha, I missed that sale. I still have all the non-Brisbin source code and my NDA has lapsed; wonder if anyone wants to buy it.

Stuart was really an all around good guy and it was honestly probably the most fun real job I've ever had thanks to him, but the owner was just hell-bent on squeezing every last dollar out of a pile of code that never should have been sold to anyone in the post-clone era and "rewrite" was not in his vocabulary. CDT was built in Think-C all the way to the end, and while it was a very different world back in the pre-MMC days when drives were even less compatible than they are today, the 200-odd page nested state machine James "Mr. Software" Merkle constructed to cope with these differences baffles me to this day. Though to be honest, I'm not too sure .AppleCD was much better off from my horrifyingly awkward conversations with Sergio. Never saw the actual source, though.

HDT was marginally better since it did get at least one major overhaul, but it was still basically one guy's "Hello World" C++ project, so finding where anything was happening was a real chore. I cannot even imagine attempting to maintain that thing without Metrowerks' code browser turning everything into hyperlinks for you.

And yeah, site's not really new (though the preferred link would be to khiltd.com since I'll probably let that old domain expire once my DBA registration lapses and I get all my stationery switched over to the LLC name ;p), I'm just really slow putting things together since a decade's worth of sitting with horrible posture traumatized my pudendal nerve to the extent that sitting at all has been virtually impossible for a few years. I'd sue somebody for not giving me an Aeron if I hadn't telecommuted for so much of that time :)

Use your RSI timers, kids!

5:28 PM, November 05, 2006

Recording Artist

Tuesday, October 31, 2006

When Benchmarks Attack

Measuring weight with a ruler

Do what sells, not what's right

De-optimizing to improve the benchmark

3 comments:

0 references:

About Me

Sections

Archive

Geekery

Magazines

Politics