C++ – The Code Corsair

The Art Of Packing Data

By redorav June 16, 2025 August 11, 2025 C++, Graphics, HLSL

The packing of data is good practice for many reasons, including disk space and efficient RAM or cache access. If we know the meaning of data we can often narrow down the range and precision, making informed decisions as to the amount of bytes we need. I was inspired once by this article and here’s my take on the topic. We’ll explore common ways of packing certain kinds of data common in videogames, their possible implementation and rationale; worthy of note is that this is not an article about compression. I’ll be using HLSL syntax but this will look very familiar to C++ and can be ported easily to any other language.

Normalized Data

This is the simplest type of data to pack so we’ll start here. Normalized data ranges from 0 to 1. You can easily normalize data by shifting and dividing by its maximum value. This mostly applies to colors or bounded values (think a shadow or transparency term) and sometimes normalized vectors, although there are better methods as we’ll see later. The D3D12 formats for this kind of data are the _UNORM class, such as R8G8B8A8_UNORM or R16G16_UNORM. The code examples below show how to encode a typical color with alpha into 8 and 16 bits, they are common cases but you can make as many variations as needed depending on the bitrate and the data you want to store.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

uint ri = packed & 0xff;

uint gi = (packed >> 8) & 0xff;

uint bi = (packed >> 16) & 0xff;

uint ai = packed >> 24;

return float4(ri, gi, bi, ai) / 255.0;

}

uint PackFloat2ToRG16Unorm(float2 value)

{

uint2 uvalue = uint2(value * 65535.0 + 0.5);

return (uvalue.g << 16) | uvalue.r;

}

float2 UnpackRG16UnormToFloat2(uint packed)

{

uint ri = packed & 0xffff;

uint gi = packed >> 16;

return float2(ri, gi) / 65535.0;

}

Note how we add 0.5 to the result after multiplying by 255. This operation followed by casting is equivalent to rounding but avoids the round instruction since the add gets factored into the multiply-add. Some of these operations are so common that many closed platforms have intrinsics or special instructions to encode and decode bits. Recently, HLSL added some special packing instructions to Shader Model 6.6 so we can also write the RGBA8 packing as follows.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 ivalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

return float4(unpack_u8u32(packed)) / 255.0;

}

We’ll stop here for a minute to analyze the RDNA bytecode generated from these instructions. I have grouped them to make logical sense as the compiler is free to reorder these. These tests were performed on the Radeon Graphics Analyzer using the 1103 RDNA3 ASIC in offline mode. We need to be careful as older RGA versions produce worse than the baseline, whereas the latest one I used here shows an improvement. As always, measure and make sure! The command line I used, should you wish to replicate the results, is .\rga.exe -s dx12 -c gfx1103 –offline –cs example.hlsl –cs-entry CSMain –cs-model cs_6_6 –dxc-opt –isa example_hlsl_v1.txt

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v4, s6, lit(0x437f0000), 0.5

v_fma_f32 v1, s4, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

v_fma_f32 v2, s5, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v4, v4

v_cvt_i32_f32 v1, v1

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v2, v2

// Shift left (3 instructions)

v_lshlrev_b32 v3, 24, v3

v_lshlrev_b32 v2, 8, v2

s_lshl_b32 s4, s4, 16

// OR them all together (3 instructions)

v_or_b32 v1, s4, v1

v_or_b32 v1, v3, v1

v_or_b32 v2, v2, v1

Total: 14 instructions

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v0, s4, lit(0x437f0000), 0.5

v_fma_f32 v1, s5, lit(0x437f0000), 0.5

v_fma_f32 v2, s6, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v0, v0

v_cvt_i32_f32 v2, v2

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v1, v1

// Permute and OR (3 instructions)

v_perm_b32 v2, v2, v3, lit(0x00040c0c)

v_perm_b32 v0, v0, v1, lit(0x0c0c0004)

v_or_b32 v0, v2, v0

Total: 11 instructions

As you can see the compiler is able to improve our hand-written logic and squeeze a couple extra instructions for our packing using v_perm_b32, an instruction that swizzles values into a single one. We don’t have the high-level instructions to perform the same operations manually which is unfortunate. There are other normalized formats commonly used in videogames that don’t have the same bit width for all components, for example R5G6B5, R5G5B5A1 or R10G10B10A2 formats. We can see how to encode and decode one of them below.

uint PackFloat4ToRGB10A2Unorm(float4 value)

{

uint3 rgbi = uint3(value.rgb * 1023.0 + 0.5);

uint ai = uint(value.a * 3.0 + 0.5);

return (ai << 30) | (rgbi.b << 20) | (rgbi.g << 10) | rgbi.r;

}

float4 UnpackRGB10A2UnormToFloat4(uint packed)

{

uint ri = packed & 0x3ff;

uint gi = (packed >> 10) & 0x3ff;

uint bi = (packed >> 20) & 0x3ff;

uint ai = packed >> 30;

return float4(float3(ri, gi, bi) / 1023.0, ai / 3.0);

}

16 Comments

Life and Death of a Graphics Programmer

By redorav May 16, 2024 June 13, 2025 C++, Graphics

Recurrent internet discussions show a divide between programmers working in different industries. Topics like code clarity, performance, debuggability, architecture or maintainability are a source of friction. We are, paraphrasing the quote, industries divided by a common language. I am curious about other programmers’ experiences, and I wanted to present a general view of mine as a graphics programmer in games, in the form of anecdotes and examples. It’s not meant to be a rant or exhaustive, rather a description of common problems, pitfalls and personal experience sprinkled in. The target audience is either videogame developers who want to nod throughout or developers writing very different software who are curious about what we do. It focuses on C++ and shader languages because that’s mostly what we use.

Hard Requirements

Videogames cram very demanding processing into modest mainstream hardware (consoles, mobile), attempting to run fast and consistently; a combination of I/O, network, audio, physics, pathfinding, low latency input, gameplay, and displaying images on screen in a handful of milliseconds. Similarly, systems like embedded hardware applications (cars, space, low latency trading) are also very constrained but operate in a very specialized domain. On another part of the software spectrum we find UI-centric programs such as word processors, browsers or management software, that are more event-driven and tolerant to a bit more latency.

There are also requirements games don’t have. Most don’t have stringent security concerns like OSs, transportation or banking (except online games or competitive e-sports). Game-breaking bugs aren’t life-threatening. High-frequency trading or automotive image processing applications have very strict correctness requirements, whereas players are mostly tolerant to some glitches as long as they’re having fun. Games don’t distribute their source code or interface with the world’s code so certain API restrictions don’t exist, e.g. we don’t build DLLs or provide SDKs. Some code is specific to a release so there’s a subset that can be hacked together right before shipping.

With that in mind, videogames care about performance in many more areas than others, not just runtime performance but also the tools. Performance becomes part of system correctness. Just as examples, all these situations from different domains are wrong:

Audio lags behind the image, or image lags behind the audio in a cutscene
Networking is too slow in an online game and the games pauses frequently
Streaming is too slow and the game stutters as you traverse
Inputs lags behind the response and causes lack of control

I once saw a cutscene system where the audio is not synced to the video/animation but instead the video tracks the audio, to avoid the typical audio drifts and getting more consistent synchronization between them. Humor and fast action is the essence of those cutscenes, and that’s a creative way to make sure the comedy lands correctly

Waiting for Mr Compiler

I spend an inordinate amount of time waiting for the computer to do things I need to work. Sometimes it’s loading, sometimes processing assets, but most of the time it’s compiling, both C++ code and shaders. Every company I worked for always used C++ for the engine and HLSL for shaders. Compile times are not unique to games, but it is the reality in every large codebase I’ve worked on; a frustrating, soulless ritual necessary to get your code from doing A to doing B. It distracts from doing meaningful work and breaks concentration. It is the very opposite of fast iteration. Let’s just state some bullet points from my experience:

A full rebuild of “the engine” can take anywhere from 10 to 40 minutes. I know of smaller codebases where it’s faster, and there’s definitely worse (e.g. Unreal Engine)
A full rebuild of “the shaders” can also take a really long time, depending on how your shader setup works
An incremental build for a single file change can take anywhere from seconds to a full rebuild’s worth of time, depending on whether you touched a header included everywhere or a cpp with no dependencies
Many shops use Incredibuild to speed up compilation. Even that is often not enough
Code lives in SSD/NVMe drives now, which means I/O is rarely the issue (compiling through the network does reintroduce the problem)
Parallel compilation is standard these days, all cores are engaged in this process
Linking is normally single threaded and can take very long
Throwing more hardware at the problem mitigates it briefly until your codebase inflates again
Some codebases use PCHs and others Unity builds. Both are improvements but also manual and difficult to maintain
We compile for many platforms. A rather extreme example, some LEGO games shipped for 7 platforms simultaneously
Every platform’s tooling is different. You might find that compiling for platform X is much slower than for platform Y

A big part of this problem stems from C’s inclusion model, the ancient and for decades refined scribal technique of copy pasting code, I’ll never understand why C++ didn’t evolve something akin to modules decades earlier and spends time developing library addons that bring anecdotal value and further slowdowns. C++ takes pride in the ‘zero-cost abstraction’ model, but that simply does not apply to compile times. Any time you include a header file in a compilation unit, you are paying a non-negligible cost even if you don’t use anything: many standard library headers take hundreds of milliseconds to compile. If you have thousands of cpps instantiating it, this adds up enormously. C++20 modules are making their way into compilers, but large codebases are going to have a hard time migrating.

There is a constant tension between convenience and compile times. I worked on a codebase where all rendering headers were put inside “render_api.h” and code from other teams included it. It was very simple to set up, but any time I touched a rendering header, it recompiled the entire codebase due to transitive inclusion. Breaking the header apart took a long time whereas putting it in the first place took no effort. Small actions can have large consequences, and the language has not provided a solution for decades

No comments

Rendering Line Lights

By redorav July 10, 2019 December 4, 2024 C++, Graphics

Within the arsenal of lights provided by game engines, the most popular are punctual lights such as point, spot or directional because they are cheap. On the other end, area lights have recently produced incredible techniques such as Linearly Transformed Cosines and other analytic approximations. I want to talk about the line light.

Update [04/09/2020] When I originally wrote the article there were no public images showing Jedi or lightsabers so I couldn’t make the connection (though a clever reader could have concluded what they might be for!) I can finally show this work off as it’s meant to be. You can also watch a gameplay trailer here.

In Unreal Engine 4, modifying ‘Source Length’ on a point light elongates it as described in this paper. It spreads the intensity along the length so a longer light becomes perceptually dimmer. Frostbite also has tube lights, a complex implementation of the analytical illuminance emitted by a cylinder and two spheres. Unity includes tube lights as well in their HD Render Pipeline (thanks Eric Heitz and Evegenii Golubev for pointing it out) based on their LTC theory, which you can find a great explanation and demos for here. Guerrilla Games’ Decima Engine has elongated quad lights using an approach for which they have a very attractive and thorough explanation in GPU Pro 5’s chapter II.1, Physically Based Area Lights. This is what I adapted to line lights.

Category: C++

The Art Of Packing Data

Normalized Data

Life and Death of a Graphics Programmer

Hard Requirements

Waiting for Mr Compiler

Rendering Line Lights

Posts