Why Intel Can't Save Me
Right now I'm working on performance-tuning the next release of one of our products. I've been through this two or three times before (I've either lost count or repressed memories, so I'm really not sure), and it's pretty much always a beating: lots of waiting for tests to run, limited visibility into what's actually going on, arbitrary targets that may or may not be meaningful, a huge amount of work to define data sets and test scripts, and worst of all no real visibility into how far away from your targets you are.
As with any beating, I'd really like to avoid it if it were possible. Every so often, I'll be talking with someone about what I'm doing, and they'll respond with some quip like, "Why are you guys doing that? Just tell our customers to buy faster boxes." If only it were that simple. Intel and the other processor manufacturers have certainly made life easier for us developers, which means that we can use higher-level, more productive but potentially slower tools like Java (and, hopefully, Ruby, at least some day). Unfortunately, that doesn't really get me off the hook: application performance, at least in a web application, is dictated by the weakest critical link. If that weak link isn't CPU-based then 2x or 4x more horsepower just isn't going to help, and even if it is it's rarely the case that all you're looking for is a 2x to 4x speedup. It's not uncommon to find that a certain bit of code, be it a computation or a database query or something else, really needs to be a hundred times faster or more.
Unfortunately for us developers and for our hapless victims/users, there are a seemingly infinite number of ways to screw up performance. Rather than really being about writing "fast" code, performance tuning an application like ours is all about finding and fixing all the "slow" code that makes up the weak links that are holding things back. How many ways are there to screw up performance in a server app likes ours, you ask? Let me count the ways . . .
Alrogithmic Problems
One thing that I've always found interesting about computer science is the sheer magnitude of the variance between an optimal algorithm and other functional but sub-optimal algorithms. Ask two engineers to build engines and you probably won't end up with one that's 1,000,000 times more powerful than the other under normal conditions. Mechanical engineering just doesn't work that way. Ask two engineers to write sorting algorithms, though, and then sort a large list (say a million names) with each of them, and there's a pretty good chance one might be a million times faster than the other.
Why is that? As every undergrad CS major knows, you measure algorithmic complexity as a relationship between running time (or some other resource) and the input size (generally represented as n). The differences in algorithms are sometimes just constant factors, like 10n versus 27n, but they're often in the order of the equation: 10n log n versus n^2, or 10n^2 versus n^4. In pathological cases, you get relationships like 2^n, which essentially means you might as well not bother unless n is a single digit.
Faster hardware can help with the constant factors, but there's really nothing a non-quantum computer is going to do about the order differences. The problem is compounded by the facts that it's non-trivial to determine the algorithmic running time, that it's easy to completely mis-estimate what sort of n to expect (and to test with far-too-low values of n), and that there are inevitably thousands of "algorithms" in a piece of software, only a handful of which actually warrant real attention. The only solution is to test a lot, find the ones that matter, and make them fast. That new quad-core CPU won't do anything to speed up a single algorithm unless it's parallelizable, and even Intel magically gives us a 4x raw speed gain instead you'll still find that 1/4th of Way Too Long is usually still Way Too Long.
Synchronization Problems
It's great that Intel is coming out with those quad-core CPUs, but how do you take advantage of that if your code only runs on one of them? To truly use all that horsepower, you need to have a multi-threaded application, which means you need to synchronize access to shared resources. If you're lucky, you can avoid the problem altogether by not having any shared resources, but that's often not a realistic option. Unfortunately for programmers (and their hapless end-users), writing correct, bug-free, performant synchronized code is really, really hard. There are a ton of interesting ways to write buggy synchronized code, but there are two pretty foolproof ways to destroy performance with synchronization. First of all, you can end up locking so much that your application essentially becomes single-threaded, as every thread ends up waiting for access to synchronized resources. Now your quad-core is magically a single processor again. Secondly, you can run into a far nastier form of that problem not by holding locks for too long but by synchronizing so much and so often that you incur a massive amount of system overhead from the acquisition and release of locks and from the attendant context switches in the OS. The first thing I've had to do on this release, in fact, is to rewrite a lot of synchronized code to be faster and to reduce the amount and duration of locking, as that's been the main bottleneck so far.
Memory Consumption, Object Allocation and Garbage Collection
Garbage collected languages are great, but the garbage collection itself isn't free. GC algorithms take time and, as anyone who's tried to use a desktop Java application knows, it can grind an application to a halt. Using too much memory can cause garbage collection to happen more often since there's less free space to play with, and depending on the hardware you're on can lead to excessive swapping. Allocating a lot of temporary objects ends up as a double performance hit, because you both incur the gc penalty more often and you generally incur some synchronization or system overhead in order to allocate the memory in the first place (at least if you're using a language that doesn't allow for stack-allocated objects). Future versions of the Java VM should really help out here, since they ought to be able to do what's called escape analysis and determine when an object can be allocated on the stack for efficiency, but for now one of my main jobs as Performance Guy is to reduce the number of object allocations that happen on each request.
Bad Database Queries
There are fewer reliable ways to grind an entire web application to a halt than a bad database query. It really only takes one to completely lock up the database and make the application unusable for everyone else. Slow queries could be caused by a lack of proper indexing, poor database statistics that cause the db to choose a bad query strategy, db optimizer bugs, or just an overly-complicated query that can't be made to perform without nasty denormalizations (if then). Faster database servers can help, but bad query plans result in performance that's so much worse than good query plans that the hardware really becomes irrelevant at that point. And unless you have your entire database in memory, a bad query that leads to a full table-scan will hammer your disks. To make things even more fun for a developer, queries will have very different performance characteristics on different size data sets and with different data distributions, which means you need to test a lot to make sure you're covering everything.
Excessive IO
If your application spends all its time talking over the network or reading from the hard drive, a faster machine isn't really going to help much. A faster hard drive or more RAM might, but hard drive improvements are incremental at best and advance nowhere near as quickly as processor speed improvements. The only real solution to excessive IO is to find it and eliminate it.
That's just a brief taxonomy of the sorts of problems you run into during performance tuning. All of them are pretty easy to do, and any one of them by themselves can bring your application to its knees and render your hardware useless. Even if hardware were free, which it most certainly is not, you'd still have to do performance tuning in order to get rid of the patholigically bad algorithms, queries, memory usage, etc. Of course the first order of business with performance tuning is to write the tests and create your sample data set (which is a pain), and the second order of business is to actually figure out what's going on (which is often the hardest part of the job), so only after a lot of back-breaking, mind-numbing labor do you actually get to the point where you're fixing things.
So what's the easiest way to avoid all that work? Simple: just don't write server applications that people will actually use. Performance tuning is, unfortunately, one of the prices of success.
As with any beating, I'd really like to avoid it if it were possible. Every so often, I'll be talking with someone about what I'm doing, and they'll respond with some quip like, "Why are you guys doing that? Just tell our customers to buy faster boxes." If only it were that simple. Intel and the other processor manufacturers have certainly made life easier for us developers, which means that we can use higher-level, more productive but potentially slower tools like Java (and, hopefully, Ruby, at least some day). Unfortunately, that doesn't really get me off the hook: application performance, at least in a web application, is dictated by the weakest critical link. If that weak link isn't CPU-based then 2x or 4x more horsepower just isn't going to help, and even if it is it's rarely the case that all you're looking for is a 2x to 4x speedup. It's not uncommon to find that a certain bit of code, be it a computation or a database query or something else, really needs to be a hundred times faster or more.
Unfortunately for us developers and for our hapless victims/users, there are a seemingly infinite number of ways to screw up performance. Rather than really being about writing "fast" code, performance tuning an application like ours is all about finding and fixing all the "slow" code that makes up the weak links that are holding things back. How many ways are there to screw up performance in a server app likes ours, you ask? Let me count the ways . . .
Alrogithmic Problems
One thing that I've always found interesting about computer science is the sheer magnitude of the variance between an optimal algorithm and other functional but sub-optimal algorithms. Ask two engineers to build engines and you probably won't end up with one that's 1,000,000 times more powerful than the other under normal conditions. Mechanical engineering just doesn't work that way. Ask two engineers to write sorting algorithms, though, and then sort a large list (say a million names) with each of them, and there's a pretty good chance one might be a million times faster than the other.
Why is that? As every undergrad CS major knows, you measure algorithmic complexity as a relationship between running time (or some other resource) and the input size (generally represented as n). The differences in algorithms are sometimes just constant factors, like 10n versus 27n, but they're often in the order of the equation: 10n log n versus n^2, or 10n^2 versus n^4. In pathological cases, you get relationships like 2^n, which essentially means you might as well not bother unless n is a single digit.
Faster hardware can help with the constant factors, but there's really nothing a non-quantum computer is going to do about the order differences. The problem is compounded by the facts that it's non-trivial to determine the algorithmic running time, that it's easy to completely mis-estimate what sort of n to expect (and to test with far-too-low values of n), and that there are inevitably thousands of "algorithms" in a piece of software, only a handful of which actually warrant real attention. The only solution is to test a lot, find the ones that matter, and make them fast. That new quad-core CPU won't do anything to speed up a single algorithm unless it's parallelizable, and even Intel magically gives us a 4x raw speed gain instead you'll still find that 1/4th of Way Too Long is usually still Way Too Long.
Synchronization Problems
It's great that Intel is coming out with those quad-core CPUs, but how do you take advantage of that if your code only runs on one of them? To truly use all that horsepower, you need to have a multi-threaded application, which means you need to synchronize access to shared resources. If you're lucky, you can avoid the problem altogether by not having any shared resources, but that's often not a realistic option. Unfortunately for programmers (and their hapless end-users), writing correct, bug-free, performant synchronized code is really, really hard. There are a ton of interesting ways to write buggy synchronized code, but there are two pretty foolproof ways to destroy performance with synchronization. First of all, you can end up locking so much that your application essentially becomes single-threaded, as every thread ends up waiting for access to synchronized resources. Now your quad-core is magically a single processor again. Secondly, you can run into a far nastier form of that problem not by holding locks for too long but by synchronizing so much and so often that you incur a massive amount of system overhead from the acquisition and release of locks and from the attendant context switches in the OS. The first thing I've had to do on this release, in fact, is to rewrite a lot of synchronized code to be faster and to reduce the amount and duration of locking, as that's been the main bottleneck so far.
Memory Consumption, Object Allocation and Garbage Collection
Garbage collected languages are great, but the garbage collection itself isn't free. GC algorithms take time and, as anyone who's tried to use a desktop Java application knows, it can grind an application to a halt. Using too much memory can cause garbage collection to happen more often since there's less free space to play with, and depending on the hardware you're on can lead to excessive swapping. Allocating a lot of temporary objects ends up as a double performance hit, because you both incur the gc penalty more often and you generally incur some synchronization or system overhead in order to allocate the memory in the first place (at least if you're using a language that doesn't allow for stack-allocated objects). Future versions of the Java VM should really help out here, since they ought to be able to do what's called escape analysis and determine when an object can be allocated on the stack for efficiency, but for now one of my main jobs as Performance Guy is to reduce the number of object allocations that happen on each request.
Bad Database Queries
There are fewer reliable ways to grind an entire web application to a halt than a bad database query. It really only takes one to completely lock up the database and make the application unusable for everyone else. Slow queries could be caused by a lack of proper indexing, poor database statistics that cause the db to choose a bad query strategy, db optimizer bugs, or just an overly-complicated query that can't be made to perform without nasty denormalizations (if then). Faster database servers can help, but bad query plans result in performance that's so much worse than good query plans that the hardware really becomes irrelevant at that point. And unless you have your entire database in memory, a bad query that leads to a full table-scan will hammer your disks. To make things even more fun for a developer, queries will have very different performance characteristics on different size data sets and with different data distributions, which means you need to test a lot to make sure you're covering everything.
Excessive IO
If your application spends all its time talking over the network or reading from the hard drive, a faster machine isn't really going to help much. A faster hard drive or more RAM might, but hard drive improvements are incremental at best and advance nowhere near as quickly as processor speed improvements. The only real solution to excessive IO is to find it and eliminate it.
That's just a brief taxonomy of the sorts of problems you run into during performance tuning. All of them are pretty easy to do, and any one of them by themselves can bring your application to its knees and render your hardware useless. Even if hardware were free, which it most certainly is not, you'd still have to do performance tuning in order to get rid of the patholigically bad algorithms, queries, memory usage, etc. Of course the first order of business with performance tuning is to write the tests and create your sample data set (which is a pain), and the second order of business is to actually figure out what's going on (which is often the hardest part of the job), so only after a lot of back-breaking, mind-numbing labor do you actually get to the point where you're fixing things.
So what's the easiest way to avoid all that work? Simple: just don't write server applications that people will actually use. Performance tuning is, unfortunately, one of the prices of success.