Performance optimization strategies of last resort [closed]

There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

Let’s assume:

  • the code already is working correctly
  • the algorithms chosen are already optimal for the circumstances of the problem
  • the code has been measured, and the offending routines have been isolated
  • all attempts to optimize will also be measured to ensure they do not make matters worse

What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

I’ll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.




Answer

OK, you’re defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November ‘93, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this . The sequence of changes was this:

  • The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
  • Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 sec.
  • Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can’t seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don’t “interpret” the sequence of things to do, “compile” it.

  • That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it’s getting so quick, it’s hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

  • More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
  • Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
  • Now the biggest time-takers are calls to malloc and free . Recycle objects - 2.6 seconds.
  • Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I’ve always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on SO.

P.S. It may be wondered why I didn’t use a profiler. The answer is that almost every one of these “problems” was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions. I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there’s no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with “information hiding” meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a “bottleneck” (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they “should have been avoided” (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */ 
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 348 of the original time, so they were not in fact a big problem at the beginning . However, after removing the first problem, they cost 320 of the time and so were now a “bigger fish”. In general, that’s how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED: The source code, both original and redesigned, can be found in www.ddj.com , for 1993, in file 9311.zip, files slug.asc and slug.zip .

EDIT 2011/11/26: There is now a sourceforge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn’t follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.