In my professional life I've found myself measured by an external benchmark many times. One time that comes to mind was when I was writing disk drivers for Mac OS 8.
The disk driver's job in Mac OS 8 was simply to pass requests from the filesystem layer to the underlying storage medium. Requests would come in for such-and-such amount of data at such-and-such an offset from the start of the volume. After some very minimal translation, we'd pass this request onward to the ATA Manager, or USB Manager, or FireWire libraries.
(Sounds easy, right? Well, yes, but the devil was in the details. USB and FireWire supported "hot-plugging", i.e. attaching or removing the drive while the computer was still running. Supporting this in an OS that was never designed for it was a big chore. CD and DVD drives were also much more difficult to handle than regular hard drives, for similar reasons.)
But we ran into a problem as we were preparing our drivers for release. The hardware manufacturers that were buying our drivers wanted to make sure our drivers were "fast". So they ran disk benchmarks against our drivers. Not an unreasonable thing to do, you might say, although as I've stated there really wasn't a lot of code in the critical path. The problem is that most of the disk benchmarks turned out to be inappropriate as measures of driver performance.
Measuring weight with a ruler
One common disk benchmark of the time was to read sequential 512-byte blocks from a large file. In Mac OS 8 these reads were passed directly down to the disk driver.
Remember, the disk driver's job was supposed to be to simply pass the incoming requests on to the next layer in the OS. The vast majority of the time would be spent accessing the hardware. So no matter what the requests were, in effect this test should have returned almost exactly the same results for all drivers, with code differences accounting for less than 1% of the results. Right? Wrong.
As we ran the tests, we found out that on this particular benchmark, our drivers were much slower than existing third-party drivers (our competition, more or less). Dramatically slower, in fact -- if we ran the test in 100 seconds, they ran it in 10 seconds.
Upon further examination, we found out the reason why they did so well on this benchmark. They had already been optimized for it!
We discovered that the other drivers contained caching logic which would cluster these small 512-byte sequential reads into larger 32KiB or so chunks. Doing so would decrease the number of round-trips to the hardware needed, which increased their performance on this benchmark.
Do what sells, not what's right
Now, it's important to understand that this tiny-sequential-read benchmark doesn't even reflect real-world use. The fact that larger I/O requests perform better than smaller I/O requests has been well known for decades, and almost every commercial application out there used large I/O requests for this very reason. Buffered I/O is built into the standard C libraries, for pete's sake. In the real world, requests that came into the disk driver tended to be either large or non-sequential.
Modifying our drivers to do the same thing would basically be pointless -- a lot of work for very little performance difference.
(It only gets worse when you realize that our disk driver was directly underneath the Mac OS disk cache. A driver-level cache was almost completely redundant. And the right place to do this sort of readahead caching would have been there, in the existing OS cache layer, not in each individual disk driver.)
But ... sigh. The real world doesn't always make sense. These small sequential reads were a well-known benchmark, even if it was a disk benchmark and not a driver benchmark. And that simple fact elevated this atypical scenario into the eyes of a lot of people who didn't really understand what it meant. To them, if our driver didn't do the same thing as the competition, we wouldn't measure up.
De-optimizing to improve the benchmark
So in order to make sales we were forced to burn another month adding and testing driver-level caching so that we'd perform better on this benchmark.
The irony? The work we did to improve this atypical benchmark actually increased the amount of time we spent in the driver code, and sometimes increased the amount of data we read, which in turn decreased real-world performance by as much as 1-2%.
But hey, at least that benchmark got faster.
And we sold our driver.