Fast SSD1306 OLED drawing with I2C bit banging
The SSD1306 OLED displays are very popular with hobbyists due to their low cost and easy interfacing. The majority of the ones sold expose a two wire interface (TWI) aka I2C. The default speed for I2C is 100Khz and the "fast" mode is 400Khz. These are the 2 standard speeds supported by most AVR Arduinos. An I2C clock rate of around 800Khz is also possible on AVR MCUs, but not supported directly by the Wire library. The I2C standard recently added some higher speeds (1Mhz and 3.4Mhz). The 3.4Mhz version uses a slightly different protocol. At 400Khz, using the I2C hardware and the Wire library, I was able to refresh the display around 23.5 frames per second (FPS) with my code.
I have already written a SSD1306 library for both Linux and Arduino, but I wanted to drive the display from an ATtiny85 and learn about the I2C protocol in the process. The ATtiny85 doesn't have I2C hardware built in, so it needs to be emulated in software using GPIO pins. There are several public domain libraries available (e.g. TinyWireM), but I wanted to see how much code is necessary to talk to a write-only I2C device and how well I could optimize it. The SSD1306 OLED controller also supports a 10Mhz SPI interface, so I assumed that the I2C interface on these displays could probably be driven faster than "spec" and not have any major issues. The code I created is not necessarily practical, nor the 'right' way to do it, but I wanted to see how fast I could get it in C/C++ without having to write it in AVR assembly language.
Caution: For this experiment, I'm running the AVR at 16Mhz with a Vcc of 4.5V. I connected the GPIO lines and Vcc directly to the SSD1306. I've seen info indicating that they're meant to run at only 3.3v, and other info showing that they're safe from 3.3 to 5V. Proceed at your own risk. If you run this code on an AVR pre-configured for 8Mhz and 3.3v, you'll see performance of half the values I measured.
I grabbed a copy of the I2C protocol specification (Rev 6, April 4, 2014) which is apparently owned by NXP Semiconductors. The condensed version is that there is typically one master and one or more slave devices on the bus (data + clock lines, aka SDA + SCL). The signal lines are normally pulled up to VCC and in tri-state (high impedance). When the master wants to begin a transaction, it sets the lines as output signals and follows the protocol. There is an acknowledge bit that gets sent back from the slave to the master after each byte is sent to signal that it was received successfully. I was curious if this could be ignored and for the SSD1306, it doesn't seem to care. This meant that I could leave the SDA and SCL lines as outputs the whole time I was writing data. Before anyone starts to complain that I'm not following the spec, for this project I'm not interested in creating a 100% compliant I2C protocol emulator, I just want to see how fast I can push the SSD1306 by bit-banging the data into the I2C pins.
For my first pass, I followed the I2C spec precisely and used the pinMode() and digitalWrite() functions for a functional baseline. As a coder, you don't want too many unknowns to have to debug, so I usually start with the simplest code to get it working. Surprisingly, the code worked the first time an
d resulted in a display refresh speed of 5.5 FPS. The clock frequency I'm generating varies from byte to byte and bit to bit, but I2C is very forgiving as long as the data is stable during the clock transitions. The speed is not impressive, but that's not a deterrent because I know that those access GPIO methods are slow. A little background - the AVR MCUs come in a variety of configurations and the GPIO ports are mapped to the pins differently depending on the chip. The pinMode and digitalWrite / digitalRead functions hide those differences by referencing everything as a physical pin number. This makes it easier to port your software from an ATMega328 to an ATtiny85. The downside to using those functions is that they do a bit more than just translate the pin numbers and this causes poor performance. The alternative way to access GPIO on AVRs is to reference the PORT (digital output) and DDR (data direction) registers directly. This makes the code less readable to people unfamiliar with ARV MCUs, but necessary in order to gain the speed.
Since my I2C protocol code was working with the slower access method, the next step was to convert it to talk directly to the GPIO ports of the AVR. The AVR MCU has unique instructions to speed up access to I/O ports, so setting or clearing a bit (setting a pin to a high or low level) of a GPIO port can be done with a single instruction and execute in a single clock cycle. After replacing all of the I/O methods, the code was now able to refresh the display at 86.5 FPS. This is an impressive speed, but not unique. I've seen existing code on Github which looks similar and probably performs about the same.
This is the part where I get creative and go beyond the "usual" ideas to get the maximum speed. My first instinct is to check that the compiler is doing a good job with my code. The default compiler flags for the Arduino IDE include "-Os". This is the equivalent of -O2 (set optimization level 2) and favor smaller code. To see what the compiler is generating, I found it easiest to use the avr-objdump tool. This is one of several AVR binary tools. Instructions for installing them can be found here. One of the things I noticed was the code being generated for an if/then/else statement wasn't as efficient as just setting the else condition by default, then checking for the "if". Here's the before and after:
This code is in the inner loop shifting the bits out to the SDA line. The before version:
Here's the one which generates faster code:
More advanced compilers would generate the same output for either set of statements, but on the AVR, we have to manually nudge the C compiler to generate the fastest output.
I also noticed that my "inline" modifiers on static functions were being ignored. This may be due to the -Os option trying to make the code small. I eventually brute forced my inline code, but let's explore some other avenues first. The I2C standard says that after transmitting each byte, the SDA line is to go into tri-state on the master and an additional clock cycle is sent to receive an acknowledge (ACK) bit from the slave device. This ACK bit (if zero) indicates that the byte was received successfully and another can be sent. The SSD1306 is designed such that you can keep sending it data bytes forever because the address wraps around when it reaches the end of the internal buffer. This should mean that we can skip checking the ACK bit. I tested this theory by setting SDA low, but not changing it to tri-state. This worked reliably, but is another area where I'm 'breaking' the spec. I also tried leaving the value of SDA at whatever it was at the end of each byte (equally probable to high or low) and that resulted in occasional failures. By removing the code to change the pin direction (DDR register), it saves a few more cycles per byte.
At this point, the refresh rate is about 90 FPS. The code I designed for this project assumes that the SDL and SCL pins are controlled by the same AVR port. This is a reasonable assumption for the ATtiny85 since it only has 1 (PORTB). By making this assumption, I can save a bit more time by combining some of the SDA + SCL operations into a single logical OR/AND and have a byte register variable hold temporary results to avoid repeated operations. For example, I can pre-read the current state of the PORT into a byte variable. This preserves the state of the other pins controlled by it. If I clear the bits I'm using for SDA and SCL, then I no longer have to re-read the PORT every time I want to set or clear a bit (the difference between read-modify-write and modify-write). This difference - writing a value from a MCU program register to a port versus using the read-modify-write instructions (e.g. PORT |= value) have a significant effect on the performance. Here's an example how I use that to my advantage in my byte shifting function:
The bOld variable is helpful to simplify the code which toggles the clock bit without disturbing the other GPIO lines controlled by that port. At this point I've gotten above 100 FPS, but there's still more to do. One of the things I've learned from working with data compression is that the shortest code path should be for the most probable symbol. A byte pattern of 0x00 and 0xFF are very probable for image data and have a unique property - all of the bits are identical. This is a time-saving property in this case because it means that the SDA line doesn't have to change while the SCL line is toggled. Adding the conditional statement does use a couple of extra clock cycles, but it's overshadowed by the savings of transmitting these frequently occurring bytes. Here's the inner loop of that byte transmit function with the new check:
With this final change (and brute force inlining of this code), the screen refresh is > 150 FPS on an ATMega32u4 and > 140 FPS on an ATtiny85. I looked at the final output of the compiler and there is still some room for improvement, but only if I write it in AVR assembly language. I may play with that at a future date, but for now, my work is done 😃. You can get the Arduino project code here.