Writing high performance code

As far as creating fast code, there is much to be said for knowing your compiler/machine well and producing highly optimized code, but with a little knowlege that I will present here, you can turn ordinary/slow code into fast code.  What follows is a set of general rules and examples which will guide you around the most common performance pitfalls.

1) Choose the right algorithm - Before you spend time optimizing or profiling that critical section of code, make sure that there isn't a better way to do it.   Changing algorithms can have much more of a dramatic effect on your code's speed than even rewriting it in assembly language.  Don't bother tweaking assembler instructions either until you are sure you have the right algorithm; you may end up throwing away that code and having to start over.

2) The 80/20 rule - In some cases this may be the 90/10 rule.   Essentially this means that your program will spend the majority of its time (80%) in a very small portion of the code (20%).  I call this 'coding by statistics' and this is how I was able to get so much speed out of my CPU emulators - by finding which code was the most frequently used and optimizing the hell out of it.  Certainly writing all of your code to be as efficient as possible is a good idea, but you needn't bother wasting time on a section which will only get executed once or infrequently.   That is why only parts of HiVE are written in assembler: CPU emulators and the graphics code.  I could write the entire program in assembly language, but the result would be very little change in speed for a large investment in time.

3) Keep system / stdlib calls to a minimum - Programmers who do not know/care about what is going on inside the machine sometimes make the mistake of assuming that any C function/keyword takes about the same amount of time to execute.  The fact is that all function calls require a pushing of parameters then a call/return which takes considerably more time than inline code.  System calls are even worse because they sometimes must pass through CPU call gates and security schemes which slow things down even more.  Shown below are two examples of what NOT to do:

a) for (i=0; i < arraysize; i++)
       {
       write(handle, mydata[i], sizeof(mydata[i]));
       }


The problem with the above code is that each call to the write() function wastes a lot of time (probably milliseconds) and should be consolidated into one large write or buffer the data in RAM first, then write it in large blocks (> 64K).  If you see any application which causes the disk activity light to blink rapidly, then it is poorly written and wasting lots of time making system calls.

b) memset(mypointer, value, 1);
   memcpy(mypointer1, mypointer2, sizeof(int));


To many programmers, the problem with the code above may seem quite obvious, but I have seen code exactly like this written by a programmer at IBM.  For moving small blocks or setting a few bytes, don't use the stdlib functions because the overhead of the pushing of parameters and call/return take MUCH longer than just assigning a value with a pointer.

4) Use optimized assembly language where it makes sense - Even the best compilers in the world will not be able to beat the optimizer between your ears.   Besides optimization, assembly language provides access to instructions that would never be generated by a compiler.  For example, the 680x0 processors have bitfield instructions which are excellent for graphics and data compression, but would never be used by a C compiler because the C language does not have efficient means to work with bits and bytes.  Yet, using these instructions in the right way can make dramatic speed improvements in your graphics or data compression routines.

5) The less memory you touch, the faster you go - Ever since the days of the 'XT-Turbo' machines, the CPU in a PC has been working with data a whole lot faster than the RAM/ROM can supply it.  Consider a 200Mhz Pentium machine with 60ns DRAM.   Instructions can execute in 1 clock cycle which is 5ns, while memory access time is at best 60ns.  The external cache RAM may be 10ns, but it will not be in use all of the time.  Using less memory, even if it makes the code more complex, can have dramatic effects on the performance.  Sometimes there is no choice, but other times through a little compression or clever means you can reduce the memory which gets modified in your 'inner loop'.

Back