Tuesday, September 04, 2007

Microbenchmarking  

I want to talk a little about a technique that can be very useful. Like all programming tools, however, it's a double-edged sword and can easily be abused.

I'm talking about microbenchmarks. A microbenchmark is a form of benchmark. Let's quickly define our terms:

Benchmark

noun: a relative measure of the performance of an operation (or set of operations) in computer software. Many factors may be used to create this measure, such as memory use or I/O, but when unspecified the measure is usually speed.

verb: to create or run a benchmark.

Microbenchmarks

Sometimes the term "benchmark" is too broad. I want to specifically talk about single benchmarks that are limited in scope. One commonly used term for this is "microbenchmark".

Microbenchmark

noun: a benchmark that measures the performance of a single operation which is a very small portion of a larger system.

verb: to create or run a microbenchmark.

I would consider each of the following to be an example of a microbenchmark:

  • Implementation comparisons. Suppose you are choosing between multiple implementations of the C standard library routine strcmp(). You might use a microbenchmark to compare different implementations against each other until you find the one that's right for you.

  • Equivalent function comparison. Sometimes there's more than one way to achieve your goal. If you want to know which is faster, you might use a microbenchmark to compare the different functions.

  • Inter-language primitive comparisons. Microbenchmarks are often used to compare programming languages. For example, Java, C++, and C# are often microbenchmarked to see which has the faster string comparison, stack allocation, heap allocation, and so on.

  • Individual aspects of a large system's performance. Examples might include adding a row to a database table, sorting a large list of input files, or even application launch time. These aspects may be measured in isolation (i.e., without a strict performance target) with the presumption that speeding them up is generally useful, since the end user's experience is comprised of lots of these small operations.

I think the last example above is the most generic description of a microbenchmark, because it encompasses the others. When you are benchmarking strcmp(), you are really measuring an individual aspect of the C library's performance. And when you are comparing language primitives, you are again measuring an individual aspect of the language's performance.

The prefix "micro-" is relative, of course. If your job description is "Chief Programmer: strcmp()", then measuring the performance of strcmp() is simply benchmarking, not microbenchmarking. But if you are working on a large modern GUI application for a consumer OS, measuring the performance of strcmp() becomes a microbenchmark because it's such a small part of your overall system.

Losing Track of the Macro

Before I start talking about how to do microbenchmarks, there's an important point to be made. Improperly applied microbenchmarks can actually be hazardous to real-world performance.

Benchmarking goes hand-in-hand with optimization when it's done correctly. You benchmark first, and then optimize. Similarly, microbenchmarking goes hand-in-hand with microoptimization. And microoptimization is rarely a good idea.

That's not to say that microbenchmarks are bad. They can be really useful when applied correctly. I remember the competitions at Apple to create the fastest BlockMove for PowerPC, which were often repeated with each new generation of processor. Since it was one of the most heavily-used system calls in the Mac OS, each incremental improvement made the OS faster. Many individual benchmarks, such as drawing and text manipulation, got faster whenever BlockMove was improved.

The danger comes not from the microbenchmarking itself. The real problem occurs when you focus so much on the micro that you forget about the big picture. There are three main negative outcomes possible:

Dangers of Microbenchmarking and Microoptimization

  • Wasting development time. Newbie programmers sometimes burn weeks optimizing something that doesn't really need optimizing. This doesn't matter much when you're doing it as a hobby, but it matters a lot when you're in a real job.

  • Making something else slower. Sometimes performance can be like a carpet. Push it down over here, and it pops up over there. If the operation that you're microbenchmarking does not crop up often in real-world scenarios, you may find that all your micro-optimizing actually decreases your real-world performance.

  • Introducing bugs. Sloppy optimization can be worse than no optimization. Your code might work today, but the trick you used might not be future-proof.

Performance is Chunky

There are two important lessons that I've learned about software performance over the years:

Lesson 1: Performance is "chunky". Not all parts of the code contribute equally to the software's speed. A small piece of code might turn out to be a very big chunk of the performance, and a big piece of code might be almost immaterial. You need to know what your chunks look like before you can optimize intelligently.

Lesson 2: Design trumps everything. Choosing the right design or algorithm for your code is far more important than almost any microoptimization you can make. If your code calls strcmp() 4 million times in a loop, you're better off changing your design to call it less often. There's no reason to waste time looking for a faster strcmp().

From these two lessons we can pretty easily derive the types of optimizations that work best. Here they are, from most effective to least effective:

Ordered Rules for Optimization

Want to make your code faster? Follow these rules in order.

  1. Design for efficiency. Understand the real task. Eliminate redundancy. Understand how your algorithm is performing, and the ways it might not be optimal. Redesign the whole thing until it is optimal.

  2. Optimize areas where your code spends a lot of time. Sometimes there is an inner loop which is repeated millions of times. The first answer is always rule #1: "don't do that" — try to redesign your code to reduce the number of times you perform that operation. But if it truly can't be avoided, then go ahead and expend some effort on speeding it up.

  3. Microoptimization. Pick a single operation and make it faster in isolation. This is almost always the wrong thing to do.

Why are microbenchmarking and microoptimization almost always wrong? Because you should be taking care of #1 or #2 first. And the fundamental essence of programming is that the first two are never complete.

When should you microbenchmark?

If you are writing part of an operating system, a virtual machine, or the runtime for a programming language. Compiler writers may also microbenchmark the output of their compiler.

Very occasionally, it might be useful to do it as part of the first two optimization tasks. Perhaps you're choosing between different algorithms, one which involves calling functionA() a thousand times, and one which involves calling functionB() a thousand times. Rather than writing two complete implementations and comparing them, it might be instructive to do a quick microbenchmark to see which of the two is faster in isolation.

You can also do it for your own amusement and education, or for a blog entry. While not terribly useful, it can be informative and teach you a thing or two about your OS.

How to write a microbenchmark

With all that said, let's get down to the details. Suppose that you eventually find yourself in a position where you really do need to run a microbenchmark.

I use a ten-step process:

  1. Consider all the factors that may impact performance. There may be non-obvious considerations like memory alignment, cache characteristics, or data set size. Decide exactly how you want to test your operation with those factors in mind.

  2. Write a loop to perform the operation a large number of times. Any single measurement might be way off, because your computer is busy doing all sorts of things like processing interrupts in the background that are out of your control. So you need to average over a longish interval. Try starting with about a hundred thousand iterations, then add multiples of ten until you wind up with a total runtime of a second or two.

  3. Use a high-resolution timer to record the start and stop time. Your data is only as good as your timer. Find the best one available. On Mac OS X that's mach_absolute_time(), and on Windows it's QueryPerformanceCounter(). Other platforms usually have an equivalent high-resolution timer. Call it only twice: once before the big loop, and once after.

  4. Write code to compute the result. The approximate time it takes for a single operation will be elapsedTime / numberOfIterations. The elapsedTime is generally (stopTime - startTime) / timerFrequency. Watch out for integer overflow and floating-point precision loss: it's easiest to just use double-precision math.

  5. Consider whether you should compile your code at the highest optimization level available. Compilers generate significantly slower code when optimization is off. You're almost certainly interested in the performance of the fully optimized build rather than the unoptimized debug build, but optimization can introduce its own difficulties. It depends on what exactly you're benchmarking.

  6. Compile, then disassemble your test to make sure it's really doing what you think it's doing. This step is non-negotiable!! If you can't read assembly, you shouldn't be doing microbenchmarking. Compilers can be tricky and there's no guarantee that it's actually doing what you think it's doing until you look. If you are testing strcmp(), for example, you might find that the compiler has decided to optimize out all ten million calls in your inner loop because you're using a string constant instead of a variable. Or perhaps your ten million iterations of input = input + 1 are being optimized out to input = input + 10000000. If this is happening to you, you'll need to figure out a way to force the compiler to do what you want. With gcc, for example, you might need to link functions in via separate modules to prevent inlining, or specify additional flags on the command line.

  7. Quiesce your system. You don't have to shut everything down, but do make sure that you're not running under abnormally heavy load. Most systems these days have a CPU meter which can tell you what's going on. If you're writing code on an embedded system, make sure that any asynchronous startup operations (like initializing the network stack) have finished before you begin.

  8. Run the test and get the results. If it runs too quickly or too slowly, you might need to adjust your iteration count and try again. If you get what looks like a good result, there's one more step:

  9. Run the test several more times and make sure your result is reproducible. This will also give you an idea of the typical error in your measurement.

  10. If you support different platforms or processors, run the test on each one. Sometimes constructs which work well with one processor are the worst-case for another. Occasionally operating systems will change the implementation of their standard library functions enough to perturb performance. Don't assume that just because it's faster on one configuration that it will always be faster everywhere.

Code Sample

Here's a piece of sample code I wrote to microbenchmark the speed of a function call on Mac OS X. I've removed the function body in order to simply show the overall technique.

/* foo.cpp */
#include <stdio.h>
#include <mach/mach_time.h>

int testfunc(int arg1, int arg2);

int main(int, const char **)
{
    struct mach_timebase_info  tb;
    uint64_t   start, stop, elapsed;
    double     seconds;
    const int  iterations = 300000000;
    int result = 0;

    mach_timebase_info(&tb);

    // Run the test.
    start = mach_absolute_time();
    for (int i=0; i<iterations; ++i)
        result = testfunc(result,i);
    stop = mach_absolute_time();
    elapsed = stop - start;

    // Print results.
    seconds = ((double)elapsed * tb.denom / tb.numer) / (1000*1000*1000);
    printf("%d iterations = %0.9lf seconds\n", iterations, seconds);
    printf("1 iteration = %0.9lf seconds\n", seconds/iterations);
    return 0;
}
/* bar.cpp */
int testfunc(int, int)
{
    return 0;
}

Notice how the function that it's calling is in a completely different source file? That's because on my first attempt gcc's inlining optimizer actually completely eliminated the function call. This was only visible upon inspecting the disassembly (step 6). Putting the target function in another module was the easiest way to stop the function from being inlined. (Yes, I tried -fno-inline and __attribute__((noinline)). Neither has the desired effect.)

I've chosen to ignore the overhead of the loop itself in this measurement. It's essentially negligible for most types of microbenchmarks. Furthermore, as long as you wind up comparing the result against another test run with identical loop overhead, it will simply cancel from both sides of the equation.

So that's it. Anyone have any interesting situations where you've found that you needed to run a microbenchmark? Does your technique differ from mine? Found any unexpected results? Let me know in the comments.