June 2025 – The Code Corsair

The packing of data is good practice for many reasons, including disk space and efficient RAM or cache access. If we know the meaning of data we can often narrow down the range and precision, making informed decisions as to the amount of bytes we need. I was inspired once by this article and here’s my take on the topic. We’ll explore common ways of packing certain kinds of data common in videogames, their possible implementation and rationale; worthy of note is that this is not an article about compression. I’ll be using HLSL syntax but this will look very familiar to C++ and can be ported easily to any other language.

Normalized Data

This is the simplest type of data to pack so we’ll start here. Normalized data ranges from 0 to 1. You can easily normalize data by shifting and dividing by its maximum value. This mostly applies to colors or bounded values (think a shadow or transparency term) and sometimes normalized vectors, although there are better methods as we’ll see later. The D3D12 formats for this kind of data are the _UNORM class, such as R8G8B8A8_UNORM or R16G16_UNORM. The code examples below show how to encode a typical color with alpha into 8 and 16 bits, they are common cases but you can make as many variations as needed depending on the bitrate and the data you want to store.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

uint ri = packed & 0xff;

uint gi = (packed >> 8) & 0xff;

uint bi = (packed >> 16) & 0xff;

uint ai = packed >> 24;

return float4(ri, gi, bi, ai) / 255.0;

}

uint PackFloat2ToRG16Unorm(float2 value)

{

uint2 uvalue = uint2(value * 65535.0 + 0.5);

return (uvalue.g << 16) | uvalue.r;

}

float2 UnpackRG16UnormToFloat2(uint packed)

{

uint ri = packed & 0xffff;

uint gi = packed >> 16;

return float2(ri, gi) / 65535.0;

}

Note how we add 0.5 to the result after multiplying by 255. This operation followed by casting is equivalent to rounding but avoids the round instruction since the add gets factored into the multiply-add. Some of these operations are so common that many closed platforms have intrinsics or special instructions to encode and decode bits. Recently, HLSL added some special packing instructions to Shader Model 6.6 so we can also write the RGBA8 packing as follows.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 ivalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

return float4(unpack_u8u32(packed)) / 255.0;

}

We’ll stop here for a minute to analyze the RDNA bytecode generated from these instructions. I have grouped them to make logical sense as the compiler is free to reorder these. These tests were performed on the Radeon Graphics Analyzer using the 1103 RDNA3 ASIC in offline mode. We need to be careful as older RGA versions produce worse than the baseline, whereas the latest one I used here shows an improvement. As always, measure and make sure! The command line I used, should you wish to replicate the results, is .\rga.exe -s dx12 -c gfx1103 –offline –cs example.hlsl –cs-entry CSMain –cs-model cs_6_6 –dxc-opt –isa example_hlsl_v1.txt

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v4, s6, lit(0x437f0000), 0.5

v_fma_f32 v1, s4, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

v_fma_f32 v2, s5, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v4, v4

v_cvt_i32_f32 v1, v1

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v2, v2

// Shift left (3 instructions)

v_lshlrev_b32 v3, 24, v3

v_lshlrev_b32 v2, 8, v2

s_lshl_b32 s4, s4, 16

// OR them all together (3 instructions)

v_or_b32 v1, s4, v1

v_or_b32 v1, v3, v1

v_or_b32 v2, v2, v1

Total: 14 instructions

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v0, s4, lit(0x437f0000), 0.5

v_fma_f32 v1, s5, lit(0x437f0000), 0.5

v_fma_f32 v2, s6, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v0, v0

v_cvt_i32_f32 v2, v2

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v1, v1

// Permute and OR (3 instructions)

v_perm_b32 v2, v2, v3, lit(0x00040c0c)

v_perm_b32 v0, v0, v1, lit(0x0c0c0004)

v_or_b32 v0, v2, v0

Total: 11 instructions

As you can see the compiler is able to improve our hand-written logic and squeeze a couple extra instructions for our packing using v_perm_b32, an instruction that swizzles values into a single one. We don’t have the high-level instructions to perform the same operations manually which is unfortunate. There are other normalized formats commonly used in videogames that don’t have the same bit width for all components, for example R5G6B5, R5G5B5A1 or R10G10B10A2 formats. We can see how to encode and decode one of them below.

uint PackFloat4ToRGB10A2Unorm(float4 value)

{

uint3 rgbi = uint3(value.rgb * 1023.0 + 0.5);

uint ai = uint(value.a * 3.0 + 0.5);

return (ai << 30) | (rgbi.b << 20) | (rgbi.g << 10) | rgbi.r;

}

float4 UnpackRGB10A2UnormToFloat4(uint packed)

{

uint ri = packed & 0x3ff;

uint gi = (packed >> 10) & 0x3ff;

uint bi = (packed >> 20) & 0x3ff;

uint ai = packed >> 30;

return float4(float3(ri, gi, bi) / 1023.0, ai / 3.0);

}

Month: June 2025

The Art Of Packing Data

Normalized Data

Posts