Dithering Algorithm

Lamborghini Dithered

An implementation of the Floyd-Steinberg dithering algorithm for very low-end Android devices, whose memory requirements were pretty strict. The phone couldn’t handle all the graphics in the 32-bit 8888 color format, and had to be scaled down to the 16-bit 1555 format. The graphics didn’t look as great as the artists had intended, therefore some form of dithering had to be applied. The maximum image size for the phone was 1024×512. The Floyd-Steinberg algorithm is a very well known dithering algorithm, with many variations that try different combinations of the error propagation matrix. I implemented the simplest one for speed. These versions are written in Java, as they targeted the Android platform, but a C++ port is coming.

I tackled the naive approach first, very flexible, very clean, and very slow. The lowest phone in our testbed, the Galaxy Ace, had to load ~900 images, dither them and load them into video memory. Therefore I inlined everything, removing all convenience function calls. This version uses a double integer array with the indices laid out as [RGBA component][pixel].

For the second approach I tried inverting the indices, to something like [pixel][RGBA component]. It is interesting that with smaller images, such as 800×480, there was a speedup. However, with bigger resolutions such as 1600×1200, performance suffers.

The third approach removes the ifs in the error propagation matrix by padding the destination array with one extra line at the bottom and the sides. The error is written into the matrix and simply discarded later, therefore mitigating the conditional performance penalty.

// Disperse the error using the Floyd-Steinberg Matrix

// Notice how up to 4 ifs can be removed

if(x + 1 < imgW) {

imgData[0][i + 1] += (redError * 7 / 16);

imgData[1][i + 1] += (greenError * 7 / 16);

imgData[2][i + 1] += (blueError * 7 / 16);

}

if(y + 1 < imgH) {

if(x > 0) {

imgData[0][i + imgW - 1] += (redError * 3 / 16);

imgData[1][i + imgW - 1] += (greenError * 3 / 16);

imgData[2][i + imgW - 1] += (blueError * 3 / 16);

}

imgData[0][i + imgW] += (redError * 5 / 16);

imgData[1][i + imgW] += (greenError * 5 / 16);

imgData[2][i + imgW] += (blueError * 5 / 16);

if(x + 1 < imgW) {

imgData[0][i + imgW + 1] += (redError / 16);

imgData[1][i + imgW + 1] += (greenError / 16);

imgData[2][i + imgW + 1] += (blueError / 16);

}

The last approach uses scanline alternation for a memory and performance optimization. Instead of loading a very big scratch array, it allocates two lines (plus the padding) and alternates between the current and next line. After each line processed, the result is written into the final buffer. There is a small performance advantage, and a very obvious big memory improvement.

int[] imgData = new int[4 * NUM_SCANLINES * (m_textureWidth + H_PADDING)];

...

for (int y = 0; y < m_textureHeight - 1; ++y)

{

int curLine = y % NUM_SCANLINES;

int nextLine = (y + 1) % NUM_SCANLINES;

int lCurPad = (m_textureWidth + H_PADDING) * curLine; // Start of current line (wrapped) in padded array

int lNextPad = (m_textureWidth + H_PADDING) * nextLine; // Start of next line (wrapped) in padded array

...

Because OpenGL requires POT textures, some textures end up being pretty big. The biggest textures in the program were 512×512 and 1024×512.

For a 512×512 image,

Approach	Time
Old	47 ms
Inverted	40 ms
If removal	33 ms
Scanline	30 ms

For a 1024×512 image,

Approach	Time
Old	54 ms
Inverted	67 ms
If removal	38 ms
Scanline	35 ms

For a 1920×1200 image,

Approach	Time
Old	104 ms
Inverted	107 ms
If removal	74 ms
Scanline	70 ms

At 2560×1600,

Approach	Time
Old	145 ms
Inverted	153 ms
If removal	111 ms
Scanline	99 ms

Note how performance degrades as images grow in size between the first two approaches, when fiddling with the order of the array indices, possibly related to how the Java VM handles them internally. The last two approaches employ a simple 1D array in order to avoid these performance discrepancies. The algorithm becomes a little more involved, but performance becomes more predictable.

All in all, the last algorithm performed ~30% faster than the first one, consuming a negligible amount of memory (2 * (imageWidth + padding) as opposed to imageWidth * imageHeight).

The Eclipse project and source code can be found here.

Dithering Algorithm

Posts