Getting the most speed from your emulator (or how I got Galaga to run on a 486-66)

What follows is a guide for emulator authors to help get the best performance from their emulators.  I will follow the steps I took to squeeze the most speed out of Galaga.   Galaga is a good example to use for this tutorial since it has a reasonable high level of complexity.  Galaga contains 3 Z80 microprocessors running in parallel at 3.125Mhz.  It uses display hardware with 3 layers - Stars, Sprites, and Characters.   After you have optimized the basic design of your emulator, the next step is to see the maximum performance gain from each component.  It is ideal to have every piece of code as optimized as possible, but the most effort should be spent on code which takes the most time.  I disabled each module one at a time (such as sprite drawing) to see what the maximum attainable speed would be if that module were to not take any time; this way I was able to concentrate my efforts on the areas which needed it most.  Besides following my basic code optimization rules, the following are the 8 main components of an emulator and how to specifically optimize each one:

CPU Emulation
In this case of Galaga, this takes up a large portion of the emulation time.   It is possible to use a multithreaded approach to simulate micros running in parallel, but since my cpu emulator code depends on static variables for speed, I chose to use a round-robin approach.  The down side to this approach is the time wasted entering and exiting (context switch) from each micro.  Galaga requires a high level of synchronization between the micros for the sound to work, so this was an area that eats lots of cycles doing context switches.  A technique which improved performance dramatically was busy-loop removal.  Busy loops are code which sits in a loop waiting for an event such as an interrupt.  Almost every cpu on almost every video game I have debugged uses a busy loop to wait for the next frame or event to begin (signalled by an interrupt).  These are wasted cycles which waste valuable time doing nothing.   I have a check in the inner loop of my cpu emulators (not published code) which checks for the busy-loop address and immediately exits when found.  This alone increased the speed of Galaga about 25-30%. I assume that I don't need to state the obvious - USE ASSEMBLY LANGUAGE WHEN POSSIBLE; the difference between cpu emulation in C and in assembly can be 2-5X.

User Input
This is more of a Windows issue than DOS.  I found some wasted time in my keyboard message handler which was calling GetAsyncKeyState() instead of just watching for keypress messages.  Every call to a Windows function eats tons of time just getting there and back, so avoid/reduce use of Windows function calls.  I also use a Sleep() function in my timing loop when there is enough time to spare to reduce CPU utilization.

Emulated I/O
This applies to all function calls which take place within the cpu emulation.   In my emulator design, I have a set of flags which mark addresses for normal or 'special' use (special use being a function pointer to a handler routine).  Galaga has a shared memory area between the three processors that required a handler routine.   I was having CPU #2 and #3 share memory from #1's memory map, and had it calling the SharedRead() routine unnecessarily in CPU #1's context.  My point is that there can be lots of fat to trim in the handler routines as well.

Sprite Drawing
This is one area of the code which I have completely rewritten at least 5 times.   Besides creating the most efficient methods of drawing/erasing sprites with transparency, there's the not so obvious point of only drawing sprites which need to be drawn.  Using MAME as a case of what not to do, there are many drivers (especially for NAMCO games) which test 'flags' which don't exist and end up drawing every possible sprite when only a few are actually enabled.  Look at the sprite memory map during various phases of gameplay and you will see what a disabled sprite looks like.   Sometimes the color is set to 0, other times the X coordinate is placed off the visible area, and still other times there is an actual flag bit indicating the sprite is off.  A little debugging will solve this and can gain you valuable speed.   Another issue is to use as few memory planes as possible.  For example, some MAME drivers will draw a character plane, then a sprite plane and later combine them.   As my optimization rule #5 states, "The less memory you touch, the faster you go".  Even if the code becomes a bit more complicated, try to keep all of the drawing in one memory area (temp bitmap).

Character Drawing
As I've found in most character/sprite games, there are only a few characters changing each frame.  I found that the most efficient way to handle this is to create a set of flags to indicate which characters change each frame and only draw them.   There are some games with scrolling regions or multiple layers which may appear to need every character drawn every frame, but this is almost never necessary.

Color Optimization
This is an important issue that is sometimes overlooked.  Many games have a palette ROM and a color ROM which look like they will require 256 or more combinations of colors, and thus require a table lookup for every sprite/char drawn.  Many times a careful analysis of gameplay will show that it is only using a subset of the colors and that they will in fact fit in a 256 color palette.  This can increase speed measurably, but not dramatically.  This step I would leave for last since it promises to give the least benefit for the effort involved.

Sound Emulation
This is basically just common sense.  I found that In the case of Galaga I only need to update the sound 60 times per second for it to sound good.  If you think about this it makes sense since most sound effects and music would not have any notes shorter than a 60th of a second.

Video Access
This is probably the slowest part of your emulation code (at least in Windows).   The video memory is considerably slower than main memory, which means you should limit how much of the screen is updated each frame.  I use a simple dirty rectangle technique in HiVE which divides the screen into 32 horizontal bars. By only copying the parts of the display which change, considerable time can be saved.  Some games make this difficult such as those that have star fields or scrolling regions.  Galaga, for example,  needs nearly the entire screen painted each frame because of the stars.  Other games such as PacMan can be highly optimized with this technique since only a small portion of the display is changing each frame.

Some may be inclined to try drawing directly onto the video buffer instead of an offscreen buffer to save time.  Very few games would work well with this technique because of flicker.  An example of a game where this is possible is Space Invaders.  Since it has a bitmapped display, it only changes a small portion in any one frame and so there would be no noticeable flicker.  A game containing sprites would usually not work well because the sprites would need to be erased and redrawn each frame leading to flicker.

Back