Pentium optimization made easy

The Intel Pentium line of microprocessors are quite complex and incorporate many optimizations for squeezing the most speed out of a poor basic design.  Many factors interact to give the overall performance of code, but there are several simple steps you can take to improve the speed of your code without having to have a degree in CPU design.   The other chip manufacturers such as Cyrix, AMD and WinChip have similar features and so these suggestions will help with them as well.

First, the most important issue when it comes to performance is the algorithm, not the code.  The fastest instructions are the ones which don't get executed :). After choosing the right algorithm, further improvement can be made by understanding a little bit about Pentium internals. The Pentium line of processors have two execution units which run in parallel and under the right conditions can execute two instructions simultaneously with the effect of executing many instructions in 1/2 a clock cycle.  Keeping the two execution pipes properly fed is a little tricky, but it can double the speed of your code. Intel decided that most code consists of moves, simple math and logic operations and so they optimized these instructions for maximum speed.  You will need to check the Intel programmers databook to get instruction cycle counts, but generally these are the only instructions which have a chance of executing in 1/2 a clock cycle.  Caching and code alignment also count a great deal in performance, but I will not get into those issues here.  However, the suggestions I have will still have a great impact.   The following list is in order of importance/impact on performance from most to least important:

1) Avoid loading of segment registers - In protected mode, the Intel chips spend many clock cycles in changing selector values.  Avoid loading the segment registers since it can eat 18-22 clock cycles.  In Win32 programming, you never need to load the segment registers since they all point to the same selector; this advice is mostly for DOS programmers.

2) Avoid unnecessary CALL/RET pairs - As you can see in my 6502 emulator source code, I have jumps to each instruction's code and then a jump back to the top of the execution loop.  It may not be as elegant as using a call/return, but it is considerably faster.

3) Avoid prefix bytes - The use of prefix bytes for segment overrides or data/address size overrides are very costly on the Pentium.  The execution pipes stall and there is a penalty of 1 or more clock cycles.  A common example of this is using WORD sized data in a USE32 segment.  e.g. MOV AX,variable.  This may look harmless enough, but in a 32-bit segment it is a performance killer.  Any time a register or memory is accessed with a different size from the segment type (WORD vs. DWORD), there is a steep penalty.  For C programmers, this means using SHORT variables in 32-bit code.   Most compilers are too stupid to handle this properly and will generate innefficient output.

4)  Avoid AGI's (Address Generation Interlock) - This is an Intel term for when an instruction is using a register for an effective address and it needing 2 clocks of setup time before it can be used.  An example is the following:


   mov  esi,addr1
/* there needs to be 2 clocks of other instructions here to avoid an AGI */
   mov  [esi],eax
   add  esi,4   ;
/* no problem doing this after EA is used */


The problem here is that the Pentium EA calculator needs 2 clocks to setup an address before it is used.  The use of ESI before it is ready will cause both instruction pipes to stall.  The register used for the EA (in this case ESI) can be modified immediately after it is used, but not before.

5) Pair your instructions properly - Instruction pairing means that you code each pair of instructions so that they can execute in parallel in the two pipes.  Another way of putting this is to not modify a register which is being used in the previous instruction.  Intel considers the byte registers to be parts of a whole, so modifying AL and then AH is equivalent to modifying EAX in both instructions.  Here are examples of good and bad code:

good:  mov  eax,[esi]       bad:    mov  eax,[esi]
       mov  ebx,[esi+4]             inc  eax
       inc  eax                     mov  ebx,[esi+4]
       inc  ebx                     inc  ebx


By properly pairing instructions you ensure that both instruction pipes are continuously running smoothly.

That's basically it.  Obviously there are many details left unsaid here, but utilizing these techniques can get you measurable improvement in your performance without making a mess of your code.  The 486 line of CPU's incorporate some of these same ideas (such as AGI's) and will also benefit from coding this way.   You should not need to make special versions of code for each type of Pentium or equivalent; also, you will not suffer any performance penalties by following these rules.  I would recommend the book by Michael Abrash "Graphics Programming - black book" as well as the Intel VTune product for helping to understand the fine details of Pentium optimization rules.

Back