Recording Artist: programming

Showing posts with label programming. Show all posts

Wednesday, April 20, 2011

Optimization Goals

In my day job, I've worked with many different teams of programmers. Game engine, game tools, OS, SDK, middleware, test engineers, you name it. Many of the people I work with are the best in the industry, true world-class programmers. And others, well, aren't... and that's okay too. :-)

Personally, I find that I've learned a ton from interacting with all of these different groups.

One of the most important lessons I've learned — and I'm frankly surprised this isn't more widely talked about — is that there are many different and equally valid possible optimization goals in programming.

When I talk to a new team, that's often the first thing I try to find out: what are your primary goals? What are you looking for?

The most common definition of optimization involves improving the runtime speed of the code. For example, game engine programmers are often very focused on this, because they have to cram as much work as they can into one frame — generally either 1/30th of a second (33.3ms) or 1/60th of a second (16.6ms). Naturally, this leads to a lot of counting cycles and downcoding inner loops into assembly.

However, game tool programmers tend to be a little more willing to sacrifice speed in favor of optimizing for other goals: development time, correctness, and ease-of-use. For example, it's often OK to use an unoptimized implementation of some algorithm in a tool, since a high-end PC will run it "fast enough". In other cases a tool might only run infrequently or overnight. Or it might be feasible to simply throw more hardware at the problem.

You also see it among different teams: a studio making a AAA game might be very focused on optimizing runtime speed, while a small indie team might be focused on optimizing for development cost — doing the work as cheaply as possible. These teams will have very different optimization goals, to the point that what the AAA game has downcoded to assembly might be C++ objects and STL in the other.

Then there are the sports games — which live in a fascinating alternate world all their own. These guys have to iterate and ship a new version once a year, every single year! Slipping is simply not an option, so they tend to optimize first for reliability, upgradeability, and ease-of-maintenance.

This got me thinking: what are all the possible optimization goals you might have?

Possible Optimization Goals

Here are some things that you may choose to optimize for in your development. It's impossible to optimize for all of these things simultaneously, but it's common to target more than one.

runtime speed - Run as fast as possible on your target system. Most of the optimization literature focuses on this. It involves choosing algorithms carefully, choosing your data layout carefully, looking for early exit opportunities, hand-optimizing inner loops, and so on.
development time - Complete the engineering work as quickly as possible. In other words, finish the job in a day or a week, rather than a month. Optimizing for development time may mean that you use the highest-level language available that will do the job, along with whatever third-party frameworks you can pull in to solve the problems you need solved. C#, Python, Ruby, etc.
development cost - Complete the engineering work as cheaply as possible. This is closely related to development time, but slightly different! The primary difference is the choice of who is doing the work: it might be cheaper (and a better investment) to have an intern use a task as a learning project and work on it for three months, than to have a senior programmer crank it out in three weeks.
correctness - Be absolutely sure there are no mistakes or bugs. Bug-free code is always a goal, of course, but sometimes it's even more important than usual. There are two common situations I can think of where correctness is your primary goal: (1) when creating a reference implementation of an algorithm, for possible later optimization. (2) when you're making important architectural decisions based on the results of some investigation or other; you'll generally double- and triple-check the investigation, if you're smart.
ease-of-use - Make sure users can grasp your interface quickly and achieve what they need to do. Funnily enough, this applies to both applications (user interfaces) and to middleware (developer interfaces). In both cases, you need to understand how people will expect to use the product. The fastest or easiest way to do things might not be the most understandable.
ease-of-maintenance - Make the code so straightforward a monkey could maintain it. Or you, in a few years. ;-) If you've ever had to go back and bug-fix some code that you (or god help you, someone else) wrote more than five years ago, you'll appreciate the value of this. Code that is well-written, easy to read, well-commented, and shaped well to fit the problem is code that is optimized for ease of maintenance. To some extent, this goal can be a cause of "not-invented-here" syndrome; the preference for in-house code is often born from a perfectly rational and reasonable desire to know how to maintain that code.
reliability - Never fail. Keep on going in the face of problems. At runtime, if the program gets unexpected input, don't crash: flag it, log it, and continue on with a sensible fallback behavior. At development time, always have a reference implementation to fall back to if the optimized version isn't working. In the extreme, reliability might require a reduction (hopefully small) in average-case performance just to ensure that the code can reliably handle uncommon edge cases. (LibC's memset and memcpy are examples of that.)
upgradeability - Make sure every subsystem can be upgraded. This is most commonly achieved with modularity. Good module separation should mean that individual pieces can be swapped out without too much trouble. If you remove modularity in order to run faster with today's situation, then you may be unable to upgrade to handle tomorrow's situation without a full rewrite.
iteration time - Minimize the time difference between when you make a change and when you are able to view and test the result. This covers a lot of ground and may include things like build time reduction, file loading, dynamic loading of updated assets, console boot time, and more. This is surprisingly important: the people who write the best software will almost always tell you that rapid iteration is king.

Did I miss any? What are you optimizing for?

Tuesday, March 16, 2010

Variants of CompareAndSwap

I ran across an interesting issue the other day while trying to make a cross-platform atomics header work on Mac OS X.

Compare-and-Swap (hereafter, CAS) is one of the fundamental atomic primitives: given a working CAS you can implement any other atomic operation.

But there are in fact two common variants of CAS:

    // Returns the old value from *ptr
    int ValueCAS(volatile int *ptr, int compare, int swap);

    // Returns true if a swap occurred
    bool BoolCAS(volatile int *ptr, int compare, int swap);

That's OK, though, because it's theoretically possible to convert either one into the other.

It's trivial to implement BoolCAS in terms of ValueCAS. Here's how you do it:

    // GOOD: this works correctly.
    bool BoolCAS(volatile int *ptr, int compare, int swap) {
        return compare == ValueCAS(ptr,compare,swap);
    }

But for what I was doing, I needed to go the other way around. Mac OS X provides BoolCAS functions in <libkern/OSAtomic.h>, and I needed to convert those to ValueCAS functions.

It turns out to be subtly difficult to go the other way around, and I thought I'd share the reasons why.

First attempt

My first attempt was fairly simple, using just an extra load and branch:

    // BAD: do not use! Broken!
    int ValueCAS(volatile int *ptr, int compare, int swap) {
        if (BoolCAS(ptr,compare,swap))
            return compare;
        else
            return *ptr;
    }

The idea is that if the CAS worked, then the old value of *ptr must have been the same as compare, so we can just return that. That's absolutely correct.

However, if the CAS failed, what was the previous value? I was naïvely re-reading *ptr. But my unit tests (oh yes, this is the kind of thing you unit-test thoroughly) showed that very rarely — once per several million iterations — it would fail to do the right thing.

What was going wrong?

The problem comes down to how callers detect success and failure from ValueCAS. If ValueCAS returns compare, the CAS is assumed to have succeeded. If ValueCAS returns any other value, the CAS is assumed to have failed.

Once we recognize this, we can see that the naïve implementation above has not one, but two failure cases:

Spurious CAS failure. This can happen on nearly any system, for a variety of reasons. I'd rather not get too deep into explanations of CPU behavior here, but the typical example might be if an interrupt occurred at exactly the wrong moment. Even if *ptr == compare when the CAS runs, a spurious failure might exit the CAS without changing the value of *ptr. This would cause ValueCAS to incorrectly report success!
Race between CAS and reload. If more than one thread is atomically modifying *ptr, then there's a window between the CAS and the reload where *ptr could be changed to anything at all — including getting set back to compare. Again, this would cause ValueCAS to incorrectly report success!

Another approach

One tempting way to fix it is to simply lie about the old value of *ptr. I didn't actually try this, but I'll admit I considered it for a moment:

    // BAD: do not use! Semantically incorrect!
    int ValueCAS(volatile int *ptr, int compare, int swap) {
        if (BoolCAS(ptr,compare,swap))
            return compare;
        else
            return (compare-1); // or any value other than compare
    }

But this is wrong! It breaks the semantics of the CAS. CAS is supposed to return the old value. But the value this function returns is totally bogus, and may never have existed at *ptr. To help imagine why this is bad, consider what might happen if the caller is using the value as a bitmask. We might wind up returning an illegal mask with bits that had never been set. This could have unintended, ugly, subtle side-effects.

Third time's the charm

As far as I can tell, the best way to fix it is by adding both an extra load and two branches:

    // GOOD: works correctly
    int ValueCAS(volatile int *ptr, int compare, int swap) {
        int old;
        do {
            if (BoolCAS(ptr,compare,swap))
                return compare;
            old = *ptr;
        } while (old == compare); // never return compare if CAS failed
        return old;
    }

And sure enough, it's a little heavy-handed, but it works. Proper branch hinting will help, but I'm still a bit surprised that it's so expensive (relatively speaking) to do this conversion.

Can anyone come up with a better way?

Sunday, December 02, 2007

The Case Against Insensitivity

One of the most controversial parts of my earlier post, Don't Be a ZFS Hater, was when I mentioned off-handedly in the comments that I don't like case-insensitivity in filesystems.

Boy, did that spur a storm of replies.

I resolved to not pollute the ZFS discussion with a discussion of case-insensitivity and promised to make a separate blog post about it. It took a while, but this is that post. I blame a busy work schedule and an even busier travel schedule. (Recently in the span of two weeks I was in California, Ohio, London, Liverpool, London, Bristol, London, Amsterdam, London, then back to Ohio. Phew!)

Here's Why Case-Insensitive Filesystems Are Bad

I've worked in and around filesystems for most of my career; if not in the filesystem itself then usually a layer just above or below it. I'm speaking from experience when I tell you:

Case-insensitivity is a bad idea in filesystems.

And here's why:

It's poorly defined.
Every filesystem does it differently.
Case-insensitivity is a layering violation.
Case-insensitivity forces layering violations upon other code.
Case-insensitivity is contagious.
Case-insensitivity adds complexity and provides no actual benefit.

I'll expand on each of these below.

It's poorly defined

When I say "case-insensitive", what does that mean to you?

If you only speak one language and that language is English, it probably seems perfectly reasonable: map the letter a to A, b to B, and so on through z to Z. There, you're done. What was so hard about that?

But that's ASCII thinking; the world left that behind a long time ago. Modern systems are expected to deal with case differences in all sorts of languages. Instead of a simple 26-letter transformation, "case insensitivity" really means handling all the other alphabets too.

The problem with doing that, however, is that it brings language and orthography into the picture. And human languages are inherently vague, large, messy, and constantly evolving.

Can you make a strict definition of "case insensitivity" without any hand-waving?

One way to do it is with an equivalence table: start listing all the characters that are equal to other characters. We can go through all the variants of Latin alphabets, including a huge list of accents: acute, grave, circumflex, umlaut, tilde, cedilla, macron, breve, dot, ring, ogonek, hacek, and bar. Don't forget to find all the special ligatures and other letters, too, such as Æ vs æ and Ø vs ø.

Okay, our table is pretty big so far. Now let's start adding in other alphabets with case: Greek, Armenian, and the Cyrillic alphabets. And don't forget the more obscure ones, like Coptic. Phew. It's getting pretty big.

Did we miss any? Well, for any given version of the Unicode standard it's always possible to enumerate all letters, so it's certainly possible to do all the legwork and prove that we've got all the case mappings for, say, Unicode 5.0.0 which is the latest at the time of this writing. But Unicode is an evolving standard and new characters are added frequently. Every time a new script with case is added we'll need to update our table.

There are also some other hard questions for case insensitivity:

Digraph characters may have three equivalent mappings, depending on how they are being written: all-lowercase, all-uppercase, or title-case. (For example: ǳ, Ǳ, or ǲ.) But this breaks some case-mapping tables which didn't anticipate the need for an N-way equivalence.
The German letter ß is considered equal to lowercase ss. Should "Straße" and "STRASSE" be considered equivalent? They are in German. But this breaks some case-mapping tables which didn't anticipate the need for an N-to-M character translation (1:2, in this case).
Capital letters can significantly alter the meaning of a word or phrase. In German, capital letters indicate nouns, so the word Essen means "food", while the word essen means "to eat". We make similar distinctions in English between proper nouns and regular nouns: God vs god, China vs china, Turkey vs turkey, and so on. Should "essen" and "Essen", or "china" and "China" really be considered equivalent?
Some Hebrew letters use different forms when at the end of a word, such as פ vs ף, or נ vs ן. Are these equivalent?
In Georgian, people recently experimented with using an obsolete alphabet called Asomtavruli to reintroduce capital letters to the written language. What if this had caught on?
What about any future characters which are not present in the current version of the Unicode standard?

Case is a concept that is built into written languages. And human language is inherently messy. This means that case-insensitivity is always going to be poorly defined, no matter how hard we try.

Every filesystem does it differently

Unfortunately, filesystems can't engage in hand-waving. Filesystem data must be persistent and forward-compatible. People expect that the data they wrote to a disk last year should still be readable this year, even if they've had an operating system upgrade.

That's a perfectly reasonable expectation. But it means that the on-disk filesystem specification needs to freeze and stop changing when it's released to the world.

Because our notion of what exactly "case-insensitive" means has changed over the past twenty years, however, we've seen a number of different methods of case-insensitivity emerge.

Here are a handful of the most popular case-insensitive filesystems and how they handle case-mapping:

FAT-32: ASCII upper- and lower-case letters, but a-z and A-Z are considered identical. Also variable IBM code pages in high ASCII.
HFS: ASCII upper- and lower-case letters, but a-z and A-Z are considered identical. Also variable Mac encodings in high ASCII.
NTFS: Case-insensitive in different ways depending on the version of Windows that created the volume.
HFS+: Case-insensitive with a mapping table which was frozen circa 1996, and thus lacks case mappings for any newer characters.

None of these — except for NTFS created by Vista — are actually up-to-date with the current Unicode specification. That's because they all predate it. Similarly, if a new filesystem were to introduce case-insensitivity today, it would be locked into, say, Unicode 5.0.0's case mappings. And that would be all well and good until Unicode 5.1.0 came along.

The history of filesystems is littered with broken historical case mappings like a trail of tears.

Case-insensitivity is a layering violation

When people argue for case-insensitivity in the filesystem, they almost always give user interface reasons for it. (The only other arguments I've seen are based on contagion, which I'll talk about in a moment.) Here is the canonical example:

My Aunt Tillie doesn't know the difference between letter.txt and Letter.txt. The filesystem should help her out.

But in fact this is a UI problem. The problem relates to the display and management of information, not the storage of this information.

Don't believe me?

When any application displays items in a window, who sorts them case-insensitively? The filesystem? No! The application does it.
When you type-select, typing b-a-b-y to select the folder "Baby Pictures" in an application, who does the case-insensitive mapping of the letters you type to the files you select? The filesystem? No! The application again.
When you save or copy files, who does the case-insensitive test to warn you if you're creating "file.txt" when "File.txt" already exists? The filesystem? Yes!

Why does the third question have a different answer than the rest?

And we've already talked about how filesystems are chronically out-of-date with their case mappings. If your aunt is a Turkish Mac user, for example, she's probably going to notice that the behavior of the third one is different for no good reason. Why are you confusing your Aunt Tülay?

One last point was summarized nicely by Mike Ash in the comments of Don't Be a ZFS Hater. I'll just quote him wholesale here:

Yes, Aunt Tillie will think that "Muffin Recipe.rtf" and "muffin recipe.rtf" ought to be the same file. But you know what? She'll also think that "Muffin Recipe .rtf" and "Recipe for Muffins.rtf" and "Mufin Recipe.txt" ought to be the same file too.

Users already don't generally understand how the OS decides whether two files are the same or not. Trying to alleviate this problem by mapping names with different case to the same file solves only 1% of the problem and just isn't worth the effort.

I agree completely.

Case-insensitivity forces layering violations upon other code

All too often, pieces of code around the system are required to hard-code knowledge about case-insensitive filesystem behavior. Here are a few examples off the top of my head:

Collision prediction. An application may need to know if two files would conflict before it actually writes either of them to disk. If you are writing an application where a user creates a group of documents — a web page editor, perhaps — you may need to know when banana.jpg and BANANA.JPG will conflict.

The most common way that programmers solve this is by hard-coding some knowledge about the case-insensitivity of the filesystem in their code. That's a classic layering violation.
Filename hashing. If you are writing code to hash strings that are filenames, you probably want equivalent paths to generate the same hash. But it's impossible to know which files are equivalent unless you know the filesystem's rules for case-mapping.

Again, the most common solution is a layering violation. You either hard-code some knowledge about the case-insensitivity tables, or you hard-code some knowledge about your input data. (For example, you may just require that you'll never, never, ever have multiple access paths for the same file in your input data. Like all layering violations, that might work wonderfully for a while ... right up until the day that it fails miserably.)

I'm sure there are more examples out there.

Case-insensitivity is contagious

This is the worst part. It's all too easy to accidentally introduce a dependence on case-insensitivity: just use an incorrect path with bad case.

The moment somebody creates an application or other system that inadvertently depends on case-insensitivity, it forces people to use a case-insensitive filesystem if they want to use that app or system. And that's one of the major reasons why case-insensitivity has stuck around — because it's historically been very difficult to get rid of.

I've seen this happen with:

Source code. Some bozo writes #include "utils.h" when the file is named Utils.h. Sounds innocent enough, until you find that it's repeated dozens of times across hundreds of files. Now that project can only ever be compiled on a case-insensitive filesystem.
Game assets. A game tries to load lipsync.dat instead of LIPSYNC.DAT. Without knowing it, the artist or developer has accidentally locked that game so that it can only run on a case-insensitive filesystem. (This causes real, constant problems in game pipelines; teams create and test their games on case-insensitive NTFS and don't notice such problems until it's burned to a case-sensitive UDF filesystem on DVD or Blu-Ray.)
Application libraries. DLLs and shared library references are sometimes generated by a build script which uses the wrong case. When that happens, the application may simply fail to launch from a case-sensitive filesystem.
Miscellaneous data files. Sometimes an application will appear to run on a case-sensitive filesystem but some feature will fail to work because it fails to load a critical data file: the spell-checking dictionary, a required font, a nib, you name it.

Happily, since Mac OS X shipped in 2001, Apple has been busy solving its own problems with case-insensitivity and encouraging its developers to test with case-sensitive filesystems. Two important initiatives in this direction have been NFS home directories and case-sensitive HFSX.

The upshot of it is that Mac OS X is actually very friendly to case-sensitive disks these days; very little that's bad happens when you use case-sensitive HFSX today.

Case-insensitivity adds complexity with no actual benefit

I'm going to make an assertion here:

ONE HUNDRED PERCENT of the path lookups happening on your Mac right now are made with correct case.

Think about that for a moment.

First off, you may think this contradicts the point I just made in the previous section. Nope; I'm simply rounding. The actual figure is something like 99.999%, and I'd probably get tired of typing 9's before I actually approached the real number. There are infinitesimally few path accesses made with incorrect case compared to the ones that are made with the proper case.

Modern computers make hundreds of filesystem accesses per second. As I type this single sentence in MarsEdit on Mac OS X 10.4.11, my computer has made 3692 filesystem accesses by path. (Yes, really. MarsEdit's "Preview" window is invoking Perl to run Markdown, which loads a handful of modules, and then WebKit re-renders the page. That's a lot of it, but meanwhile there's background activity from Mail, Activity Monitor, iChat, SystemUIServer, iCalAlarmScheduler, AirPort Base Station Agent, Radioshift, NetNewsWire, Twitterrific, and Safari.)

Under Mac OS X you can measure it yourself with this command in Terminal:

sudo fs_usage -f filesys | grep / > /tmp/accesses.txt

The vast majority of file accesses are made with paths that were returned from the filesystem itself: some bit of code read the contents of a directory, and passed the results on to another bit of code, which eventually decided to access one of those files. So most of the time the filesystem is getting back the paths that it has returned earlier. Very very few accesses are made with paths that come directly from an error-prone human, which is why essentially 100% of filesystem accesses are made with correct case.

But if essentially all filesystem accesses are made with the correct case to begin with, why do we even have case-insensitivity at all?

We've already discussed the problems of contagion, which is a circular justification: we have to do it because someone else did it first. We've also discussed UI decisions being incorrectly implemented in the bottommost layer of the operating system. Other than those two, what good is it?

I don't have an answer to that. For the life of me I can't come up with any reason to justify case-insensitive filesystems from a pure design standpoint. That leads me to my closing argument, which is...

A thought experiment

Suppose case-insensitive filesystems had never been invented. You're the leader of a team of engineers in charge of XYZZYFS, the next big thing in filesystems. One day you tell the other people who work on it:

"Hey! I've got this great idea! It's called case-insensitivity. We'll take every path that comes into the filesystem and compare it against a huge table to create a case-folded version of the path which we'll use for comparisons and sorting. This will add a bunch of complexity to the code, slow down all path lookups, increase our RAM footprint, make it more difficult for users of our filesystem to handle paths, and create a compatibility nightmare for future versions if we ever decide to change the table. But, you see, it'll all be worth it, because... _________________."

Can you fill in the blank?

Tuesday, September 04, 2007

Microbenchmarking

I want to talk a little about a technique that can be very useful. Like all programming tools, however, it's a double-edged sword and can easily be abused.

I'm talking about microbenchmarks. A microbenchmark is a form of benchmark. Let's quickly define our terms:

Benchmark: noun: a relative measure of the performance of an operation (or set of operations) in computer software. Many factors may be used to create this measure, such as memory use or I/O, but when unspecified the measure is usually speed.; verb: to create or run a benchmark.

Microbenchmarks

Sometimes the term "benchmark" is too broad. I want to specifically talk about single benchmarks that are limited in scope. One commonly used term for this is "microbenchmark".

Microbenchmark: noun: a benchmark that measures the performance of a single operation which is a very small portion of a larger system.; verb: to create or run a microbenchmark.

I would consider each of the following to be an example of a microbenchmark:

Implementation comparisons. Suppose you are choosing between multiple implementations of the C standard library routine strcmp(). You might use a microbenchmark to compare different implementations against each other until you find the one that's right for you.
Equivalent function comparison. Sometimes there's more than one way to achieve your goal. If you want to know which is faster, you might use a microbenchmark to compare the different functions.
Inter-language primitive comparisons. Microbenchmarks are often used to compare programming languages. For example, Java, C++, and C# are often microbenchmarked to see which has the faster string comparison, stack allocation, heap allocation, and so on.
Individual aspects of a large system's performance. Examples might include adding a row to a database table, sorting a large list of input files, or even application launch time. These aspects may be measured in isolation (i.e., without a strict performance target) with the presumption that speeding them up is generally useful, since the end user's experience is comprised of lots of these small operations.

I think the last example above is the most generic description of a microbenchmark, because it encompasses the others. When you are benchmarking strcmp(), you are really measuring an individual aspect of the C library's performance. And when you are comparing language primitives, you are again measuring an individual aspect of the language's performance.

The prefix "micro-" is relative, of course. If your job description is "Chief Programmer: strcmp()", then measuring the performance of strcmp() is simply benchmarking, not microbenchmarking. But if you are working on a large modern GUI application for a consumer OS, measuring the performance of strcmp() becomes a microbenchmark because it's such a small part of your overall system.

Losing Track of the Macro

Before I start talking about how to do microbenchmarks, there's an important point to be made. Improperly applied microbenchmarks can actually be hazardous to real-world performance.

Benchmarking goes hand-in-hand with optimization when it's done correctly. You benchmark first, and then optimize. Similarly, microbenchmarking goes hand-in-hand with microoptimization. And microoptimization is rarely a good idea.

That's not to say that microbenchmarks are bad. They can be really useful when applied correctly. I remember the competitions at Apple to create the fastest BlockMove for PowerPC, which were often repeated with each new generation of processor. Since it was one of the most heavily-used system calls in the Mac OS, each incremental improvement made the OS faster. Many individual benchmarks, such as drawing and text manipulation, got faster whenever BlockMove was improved.

The danger comes not from the microbenchmarking itself. The real problem occurs when you focus so much on the micro that you forget about the big picture. There are three main negative outcomes possible:

Dangers of Microbenchmarking and Microoptimization

Wasting development time. Newbie programmers sometimes burn weeks optimizing something that doesn't really need optimizing. This doesn't matter much when you're doing it as a hobby, but it matters a lot when you're in a real job.
Making something else slower. Sometimes performance can be like a carpet. Push it down over here, and it pops up over there. If the operation that you're microbenchmarking does not crop up often in real-world scenarios, you may find that all your micro-optimizing actually decreases your real-world performance.
Introducing bugs. Sloppy optimization can be worse than no optimization. Your code might work today, but the trick you used might not be future-proof.

Performance is Chunky

There are two important lessons that I've learned about software performance over the years:

Lesson 1: Performance is "chunky". Not all parts of the code contribute equally to the software's speed. A small piece of code might turn out to be a very big chunk of the performance, and a big piece of code might be almost immaterial. You need to know what your chunks look like before you can optimize intelligently.

Lesson 2: Design trumps everything. Choosing the right design or algorithm for your code is far more important than almost any microoptimization you can make. If your code calls strcmp() 4 million times in a loop, you're better off changing your design to call it less often. There's no reason to waste time looking for a faster strcmp().

From these two lessons we can pretty easily derive the types of optimizations that work best. Here they are, from most effective to least effective:

Ordered Rules for Optimization

Want to make your code faster? Follow these rules in order.

Design for efficiency. Understand the real task. Eliminate redundancy. Understand how your algorithm is performing, and the ways it might not be optimal. Redesign the whole thing until it is optimal.
Optimize areas where your code spends a lot of time. Sometimes there is an inner loop which is repeated millions of times. The first answer is always rule #1: "don't do that" — try to redesign your code to reduce the number of times you perform that operation. But if it truly can't be avoided, then go ahead and expend some effort on speeding it up.
Microoptimization. Pick a single operation and make it faster in isolation. This is almost always the wrong thing to do.

Why are microbenchmarking and microoptimization almost always wrong? Because you should be taking care of #1 or #2 first. And the fundamental essence of programming is that the first two are never complete.

When should you microbenchmark?

If you are writing part of an operating system, a virtual machine, or the runtime for a programming language. Compiler writers may also microbenchmark the output of their compiler.

Very occasionally, it might be useful to do it as part of the first two optimization tasks. Perhaps you're choosing between different algorithms, one which involves calling functionA() a thousand times, and one which involves calling functionB() a thousand times. Rather than writing two complete implementations and comparing them, it might be instructive to do a quick microbenchmark to see which of the two is faster in isolation.

You can also do it for your own amusement and education, or for a blog entry. While not terribly useful, it can be informative and teach you a thing or two about your OS.

How to write a microbenchmark

With all that said, let's get down to the details. Suppose that you eventually find yourself in a position where you really do need to run a microbenchmark.

I use a ten-step process:

Consider all the factors that may impact performance. There may be non-obvious considerations like memory alignment, cache characteristics, or data set size. Decide exactly how you want to test your operation with those factors in mind.
Write a loop to perform the operation a large number of times. Any single measurement might be way off, because your computer is busy doing all sorts of things like processing interrupts in the background that are out of your control. So you need to average over a longish interval. Try starting with about a hundred thousand iterations, then add multiples of ten until you wind up with a total runtime of a second or two.
Use a high-resolution timer to record the start and stop time. Your data is only as good as your timer. Find the best one available. On Mac OS X that's mach_absolute_time(), and on Windows it's QueryPerformanceCounter(). Other platforms usually have an equivalent high-resolution timer. Call it only twice: once before the big loop, and once after.
Write code to compute the result. The approximate time it takes for a single operation will be elapsedTime / numberOfIterations. The elapsedTime is generally (stopTime - startTime) / timerFrequency. Watch out for integer overflow and floating-point precision loss: it's easiest to just use double-precision math.
Consider whether you should compile your code at the highest optimization level available. Compilers generate significantly slower code when optimization is off. You're almost certainly interested in the performance of the fully optimized build rather than the unoptimized debug build, but optimization can introduce its own difficulties. It depends on what exactly you're benchmarking.
Compile, then disassemble your test to make sure it's really doing what you think it's doing. This step is non-negotiable!! If you can't read assembly, you shouldn't be doing microbenchmarking. Compilers can be tricky and there's no guarantee that it's actually doing what you think it's doing until you look. If you are testing strcmp(), for example, you might find that the compiler has decided to optimize out all ten million calls in your inner loop because you're using a string constant instead of a variable. Or perhaps your ten million iterations of input = input + 1 are being optimized out to input = input + 10000000. If this is happening to you, you'll need to figure out a way to force the compiler to do what you want. With gcc, for example, you might need to link functions in via separate modules to prevent inlining, or specify additional flags on the command line.
Quiesce your system. You don't have to shut everything down, but do make sure that you're not running under abnormally heavy load. Most systems these days have a CPU meter which can tell you what's going on. If you're writing code on an embedded system, make sure that any asynchronous startup operations (like initializing the network stack) have finished before you begin.
Run the test and get the results. If it runs too quickly or too slowly, you might need to adjust your iteration count and try again. If you get what looks like a good result, there's one more step:
Run the test several more times and make sure your result is reproducible. This will also give you an idea of the typical error in your measurement.
If you support different platforms or processors, run the test on each one. Sometimes constructs which work well with one processor are the worst-case for another. Occasionally operating systems will change the implementation of their standard library functions enough to perturb performance. Don't assume that just because it's faster on one configuration that it will always be faster everywhere.

Code Sample

Here's a piece of sample code I wrote to microbenchmark the speed of a function call on Mac OS X. I've removed the function body in order to simply show the overall technique.

/* foo.cpp */
#include <stdio.h>
#include <mach/mach_time.h>

int testfunc(int arg1, int arg2);

int main(int, const char **)
{
    struct mach_timebase_info  tb;
    uint64_t   start, stop, elapsed;
    double     seconds;
    const int  iterations = 300000000;
    int result = 0;

    mach_timebase_info(&tb);

    // Run the test.
    start = mach_absolute_time();
    for (int i=0; i<iterations; ++i)
        result = testfunc(result,i);
    stop = mach_absolute_time();
    elapsed = stop - start;

    // Print results.
    seconds = ((double)elapsed * tb.denom / tb.numer) / (1000*1000*1000);
    printf("%d iterations = %0.9lf seconds\n", iterations, seconds);
    printf("1 iteration = %0.9lf seconds\n", seconds/iterations);
    return 0;
}

/* bar.cpp */
int testfunc(int, int)
{
    return 0;
}

Notice how the function that it's calling is in a completely different source file? That's because on my first attempt gcc's inlining optimizer actually completely eliminated the function call. This was only visible upon inspecting the disassembly (step 6). Putting the target function in another module was the easiest way to stop the function from being inlined. (Yes, I tried -fno-inline and __attribute__((noinline)). Neither has the desired effect.)

I've chosen to ignore the overhead of the loop itself in this measurement. It's essentially negligible for most types of microbenchmarks. Furthermore, as long as you wind up comparing the result against another test run with identical loop overhead, it will simply cancel from both sides of the equation.

So that's it. Anyone have any interesting situations where you've found that you needed to run a microbenchmark? Does your technique differ from mine? Found any unexpected results? Let me know in the comments.

Sunday, August 19, 2007

BASIC for the iPhone

This year C4[1] played host to Iron Coder Live. The API was "iPhone" and the theme was "conspiracy". The winner was well-deserved. But I had fun doing a hack of my own.

There was just one problem: I don't have an iPhone.

As I was driving from Cleveland to Chicago, I had a lot of time—about five hours—to think about what I might do for the contest. Without an actual iPhone it had to be something that could be largely developed offline and only run on the phone at the contest. That meant that I would need to stick to the, ahem, "official API" of the iPhone: HTML and JavaScript.

Although the phone is cool, I think it's even cooler that a whole community has built up to create the SDK that Apple failed to provide. After talking with some old friends at the conference I decided that it would be fun to join the conspiracy and create my own SDK for the iPhone, offering the ultimate retro-cool language: BASIC for the iPhone.

The version I showed at the conference garnered a very respectable 5th place, which I'm more than satisfied with considering its far more polished competition. I've since cleaned it up, GPLed it, and rechristened it. And here it is:

ippleSoft BASIC

The interface has been scaled for the iPhone, but it works just fine in Safari and Firefox.

ippleSoft BASIC for the iPhone - Run it now!
View the source
Command reference - also available by entering HELP at the prompt.

Notes

Naming: I tried to pick a name that wouldn't infringe on anyone's existing trademark. That's harder than you might think: "iPhone BASIC" would surely lead to a cease-and-desist letter from the hounds. iBASIC, IP-BASIC, etc were all taken. Then I ran into a stroke of luck: It turns out "AppleSoft" is no longer a registered trademark of Apple. They let it expire in 2001. Thus: ippleSoft.

Entirely JavaScript, entirely free. I've made it available under GPL v2. There's no public repository yet, but hey, it's GPL, so you can go and create one yourself. I've released code into the public domain before, but this is the first project I've ever released under GPL.

I'm not a JavaScript programmer. I learned about the language and tried to follow best practice, though. Well, as much as I could during a single beer-soaked weekend.

Autoscroll: The output display will autoscroll in Mac Safari, but not iPhone Mobile Safari. You can flick the output display to scroll it, though. If anyone with an iPhone has a suggestion on how to make it autoscroll properly please let me know!

Language: For now I've kept it pretty close to a strict subset of AppleSoft BASIC. There's a lot it doesn't do yet: graphics and text positioning are not supported, and neither are INPUT or GET. Those are pretty high on the list of priorities though.

One final thought

You know what? Text-based interfaces suck on the iPhone. There's just no getting around this. It's not the output so much as the input. The iPhone feels great when you're pushing buttons and flicking the display to scroll. But the keyboard is really lame.

I'm not sure if that's a feature or a bug for ippleSoft BASIC. Personally I find it kind of perversely amusing to use the iPhone to run a language originally designed for teletypes. But it would be interesting to explore ways to take the text out of BASIC.

Loading and running programs from elsewhere on the web would be a natural. A keyword-board like the one on the Timex-Sinclair TS1000 might be useful if you really must write code on the iPhone. INPUT and GET may need to be supplemented with some type of dialog interface. Anything else?

What do you think?

Tuesday, October 31, 2006

When Benchmarks Attack

In my professional life I've found myself measured by an external benchmark many times. One time that comes to mind was when I was writing disk drivers for Mac OS 8.

The disk driver's job in Mac OS 8 was simply to pass requests from the filesystem layer to the underlying storage medium. Requests would come in for such-and-such amount of data at such-and-such an offset from the start of the volume. After some very minimal translation, we'd pass this request onward to the ATA Manager, or USB Manager, or FireWire libraries.

(Sounds easy, right? Well, yes, but the devil was in the details. USB and FireWire supported "hot-plugging", i.e. attaching or removing the drive while the computer was still running. Supporting this in an OS that was never designed for it was a big chore. CD and DVD drives were also much more difficult to handle than regular hard drives, for similar reasons.)

But we ran into a problem as we were preparing our drivers for release. The hardware manufacturers that were buying our drivers wanted to make sure our drivers were "fast". So they ran disk benchmarks against our drivers. Not an unreasonable thing to do, you might say, although as I've stated there really wasn't a lot of code in the critical path. The problem is that most of the disk benchmarks turned out to be inappropriate as measures of driver performance.

Measuring weight with a ruler

One common disk benchmark of the time was to read sequential 512-byte blocks from a large file. In Mac OS 8 these reads were passed directly down to the disk driver.

Remember, the disk driver's job was supposed to be to simply pass the incoming requests on to the next layer in the OS. The vast majority of the time would be spent accessing the hardware. So no matter what the requests were, in effect this test should have returned almost exactly the same results for all drivers, with code differences accounting for less than 1% of the results. Right? Wrong.

As we ran the tests, we found out that on this particular benchmark, our drivers were much slower than existing third-party drivers (our competition, more or less). Dramatically slower, in fact -- if we ran the test in 100 seconds, they ran it in 10 seconds.

Upon further examination, we found out the reason why they did so well on this benchmark. They had already been optimized for it!

We discovered that the other drivers contained caching logic which would cluster these small 512-byte sequential reads into larger 32KiB or so chunks. Doing so would decrease the number of round-trips to the hardware needed, which increased their performance on this benchmark.

Do what sells, not what's right

Now, it's important to understand that this tiny-sequential-read benchmark doesn't even reflect real-world use. The fact that larger I/O requests perform better than smaller I/O requests has been well known for decades, and almost every commercial application out there used large I/O requests for this very reason. Buffered I/O is built into the standard C libraries, for pete's sake. In the real world, requests that came into the disk driver tended to be either large or non-sequential.

Modifying our drivers to do the same thing would basically be pointless -- a lot of work for very little performance difference.

(It only gets worse when you realize that our disk driver was directly underneath the Mac OS disk cache. A driver-level cache was almost completely redundant. And the right place to do this sort of readahead caching would have been there, in the existing OS cache layer, not in each individual disk driver.)

But ... sigh. The real world doesn't always make sense. These small sequential reads were a well-known benchmark, even if it was a disk benchmark and not a driver benchmark. And that simple fact elevated this atypical scenario into the eyes of a lot of people who didn't really understand what it meant. To them, if our driver didn't do the same thing as the competition, we wouldn't measure up.

De-optimizing to improve the benchmark

So in order to make sales we were forced to burn another month adding and testing driver-level caching so that we'd perform better on this benchmark.

The irony? The work we did to improve this atypical benchmark actually increased the amount of time we spent in the driver code, and sometimes increased the amount of data we read, which in turn decreased real-world performance by as much as 1-2%.

But hey, at least that benchmark got faster.

And we sold our driver.

Recording Artist