Dithering Algorithm

Lamborghini Dithered

Lamborghini Dithered

An implementation of the Floyd-Steinberg dithering algorithm for very low-end Android devices, whose memory requirements were pretty strict. The phone couldn’t handle all the graphics in the 32-bit 8888 color format, and had to be scaled down to the 16-bit 1555 format. The graphics didn’t look as great as the artists had intended, therefore some form of dithering had to be applied. The maximum image size for the phone was 1024×512. The Floyd-Steinberg algorithm is a very well known dithering algorithm, with many variations that try different combinations of the error propagation matrix. I implemented the simplest one for speed. These versions are written in Java, as they targeted the Android platform, but a C++ port is coming.

I tackled the naive approach first, very flexible, very clean, and very slow. The lowest phone in our testbed, the Galaxy Ace, had to load ~900 images, dither them and load them into video memory. Therefore I inlined everything, removing all convenience function calls. This version uses a double integer array with the indices laid out as [RGBA component][pixel].

For the second approach I tried inverting the indices, to something like [pixel][RGBA component]. It is interesting that with smaller images, such as 800×480, there was a speedup. However, with bigger resolutions such as 1600×1200, performance suffers.

The third approach removes the ifs in the error propagation matrix by padding the destination array with one extra line at the bottom and the sides. The error is written into the matrix and simply discarded later, therefore mitigating the conditional performance penalty.

The last approach uses scanline alternation for a memory and performance optimization. Instead of loading a very big scratch array, it allocates two lines (plus the padding) and alternates between the current and next line. After each line processed, the result is written into the final buffer. There is a small performance advantage, and a very obvious big memory improvement.

Because OpenGL requires POT textures, some textures end up being pretty big. The biggest textures in the program were 512×512 and 1024×512.

For a 512×512 image,

Approach Time
Old  47 ms
Inverted 40 ms
If removal 33 ms
Scanline 30 ms

For a 1024×512 image,

Approach Time
Old 54 ms
Inverted 67 ms
If removal 38 ms
Scanline 35 ms

For a 1920×1200 image,

 Approach  Time
 Old  104 ms
Inverted 107 ms
If removal 74 ms
Scanline  70 ms

At 2560×1600,

Approach Time
Old 145 ms
Inverted 153 ms
If removal 111 ms
Scanline 99 ms

Note how performance degrades as images grow in size between the first two approaches, when fiddling with the order of the array indices, possibly related to how the Java VM handles them internally. The last two approaches employ a simple 1D array in order to avoid these performance discrepancies. The algorithm becomes a little more involved, but performance becomes more predictable.

All in all, the last algorithm performed ~30% faster than the first one, consuming a negligible amount of memory (2 * (imageWidth + padding) as opposed to imageWidth * imageHeight).

The Eclipse project and source code can be found here.

Comments are closed.