|
The Intel Pentium line of microprocessors are quite
complex and incorporate many optimizations for squeezing the most
speed out of a poor basic design. Many factors interact to
give the overall performance of code, but there are several simple
steps you can take to improve the speed of your code without having
to have a degree in CPU design. The other chip manufacturers
such as Cyrix, AMD and WinChip have similar features and so these
suggestions will help with them as well.
First, the most important issue when it comes to performance
is the algorithm, not the code. The fastest instructions are
the ones which don't get executed :). After choosing the right algorithm,
further improvement can be made by understanding a little bit about
Pentium internals. The Pentium line of processors have two execution
units which run in parallel and under the right conditions can execute
two instructions simultaneously with the effect of executing many
instructions in 1/2 a clock cycle. Keeping the two execution
pipes properly fed is a little tricky, but it can double the speed
of your code. Intel decided that most code consists of moves, simple
math and logic operations and so they optimized these instructions
for maximum speed. You will need to check the Intel programmers
databook to get instruction cycle counts, but generally these are
the only instructions which have a chance of executing in 1/2 a
clock cycle. Caching and code alignment also count a great
deal in performance, but I will not get into those issues here.
However, the suggestions I have will still have a great impact.
The following list is in order of importance/impact on performance
from most to least important:
1) Avoid loading of segment registers - In protected
mode, the Intel chips spend many clock cycles in changing selector
values. Avoid loading the segment registers since it can eat
18-22 clock cycles. In Win32 programming, you never need to
load the segment registers since they all point to the same selector;
this advice is mostly for DOS programmers.
2) Avoid unnecessary CALL/RET pairs - As you can see
in my 6502 emulator source code, I have jumps to each instruction's
code and then a jump back to the top of the execution loop.
It may not be as elegant as using a call/return, but it is considerably
faster.
3) Avoid prefix bytes - The use of prefix bytes for
segment overrides or data/address size overrides are very costly
on the Pentium. The execution pipes stall and there is a penalty
of 1 or more clock cycles. A common example of this is using
WORD sized data in a USE32 segment. e.g. MOV AX,variable.
This may look harmless enough, but in a 32-bit segment it is a performance
killer. Any time a register or memory is accessed with a different
size from the segment type (WORD vs. DWORD), there is a steep penalty.
For C programmers, this means using SHORT variables in 32-bit code.
Most compilers are too stupid to handle this properly and
will generate innefficient output.
4) Avoid AGI's (Address Generation Interlock)
- This is an Intel term for when an instruction is using a register
for an effective address and it needing 2 clocks of setup time before
it can be used. An example is the following:
mov
esi,addr1
mov [esi],eax
add esi,4 ;
The problem here is that the
Pentium EA calculator needs 2 clocks to setup an address before
it is used. The use of ESI before it is ready will cause both
instruction pipes to stall. The register used for the EA (in
this case ESI) can be modified immediately after it is used, but
not before.
5) Pair your instructions properly - Instruction
pairing means that you code each pair of instructions so that they
can execute in parallel in the two pipes. Another way of putting
this is to not modify a register which is being used in the previous
instruction. Intel considers the byte registers to be parts
of a whole, so modifying AL and then AH is equivalent to modifying
EAX in both instructions. Here are examples of good and bad
code:
mov eax,[esi]
mov eax,[esi]
mov ebx,[esi+4]
inc eax
inc eax
mov ebx,[esi+4]
inc ebx
inc ebx
By properly pairing instructions
you ensure that both instruction pipes are continuously running
smoothly.
That's basically it. Obviously there are
many details left unsaid here, but utilizing these techniques can
get you measurable improvement in your performance without making
a mess of your code. The 486 line of CPU's incorporate some
of these same ideas (such as AGI's) and will also benefit from coding
this way. You should not need to make special versions of
code for each type of Pentium or equivalent; also, you will not
suffer any performance penalties by following these rules.
I would recommend the book by Michael Abrash "Graphics Programming
- black book" as well as the Intel VTune product for helping
to understand the fine details of Pentium optimization rules.
Webdesign
by Deep Magic Studios
- HanaHo Games, Inc. Copyright © 2002 |