The Code Corsair

The Art Of Packing Data

By redorav June 16, 2025 August 11, 2025 C++, Graphics, HLSL

The packing of data is good practice for many reasons, including disk space and efficient RAM or cache access. If we know the meaning of data we can often narrow down the range and precision, making informed decisions as to the amount of bytes we need. I was inspired once by this article and here’s my take on the topic. We’ll explore common ways of packing certain kinds of data common in videogames, their possible implementation and rationale; worthy of note is that this is not an article about compression. I’ll be using HLSL syntax but this will look very familiar to C++ and can be ported easily to any other language.

Normalized Data

This is the simplest type of data to pack so we’ll start here. Normalized data ranges from 0 to 1. You can easily normalize data by shifting and dividing by its maximum value. This mostly applies to colors or bounded values (think a shadow or transparency term) and sometimes normalized vectors, although there are better methods as we’ll see later. The D3D12 formats for this kind of data are the _UNORM class, such as R8G8B8A8_UNORM or R16G16_UNORM. The code examples below show how to encode a typical color with alpha into 8 and 16 bits, they are common cases but you can make as many variations as needed depending on the bitrate and the data you want to store.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

uint ri = packed & 0xff;

uint gi = (packed >> 8) & 0xff;

uint bi = (packed >> 16) & 0xff;

uint ai = packed >> 24;

return float4(ri, gi, bi, ai) / 255.0;

}

uint PackFloat2ToRG16Unorm(float2 value)

{

uint2 uvalue = uint2(value * 65535.0 + 0.5);

return (uvalue.g << 16) | uvalue.r;

}

float2 UnpackRG16UnormToFloat2(uint packed)

{

uint ri = packed & 0xffff;

uint gi = packed >> 16;

return float2(ri, gi) / 65535.0;

}

Note how we add 0.5 to the result after multiplying by 255. This operation followed by casting is equivalent to rounding but avoids the round instruction since the add gets factored into the multiply-add. Some of these operations are so common that many closed platforms have intrinsics or special instructions to encode and decode bits. Recently, HLSL added some special packing instructions to Shader Model 6.6 so we can also write the RGBA8 packing as follows.

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 ivalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

float4 UnpackRGBA8UnormToFloat4(uint packed)

{

return float4(unpack_u8u32(packed)) / 255.0;

}

We’ll stop here for a minute to analyze the RDNA bytecode generated from these instructions. I have grouped them to make logical sense as the compiler is free to reorder these. These tests were performed on the Radeon Graphics Analyzer using the 1103 RDNA3 ASIC in offline mode. We need to be careful as older RGA versions produce worse than the baseline, whereas the latest one I used here shows an improvement. As always, measure and make sure! The command line I used, should you wish to replicate the results, is .\rga.exe -s dx12 -c gfx1103 –offline –cs example.hlsl –cs-entry CSMain –cs-model cs_6_6 –dxc-opt –isa example_hlsl_v1.txt

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return (uvalue.a << 24) | (uvalue.b << 16) | (uvalue.g << 8) | uvalue.r;

}

uint PackFloat4ToRGBA8Unorm(float4 value)

{

uint4 uvalue = uint4(value * 255.0 + 0.5);

return pack_u8(uvalue);

}

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v4, s6, lit(0x437f0000), 0.5

v_fma_f32 v1, s4, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

v_fma_f32 v2, s5, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v4, v4

v_cvt_i32_f32 v1, v1

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v2, v2

// Shift left (3 instructions)

v_lshlrev_b32 v3, 24, v3

v_lshlrev_b32 v2, 8, v2

s_lshl_b32 s4, s4, 16

// OR them all together (3 instructions)

v_or_b32 v1, s4, v1

v_or_b32 v1, v3, v1

v_or_b32 v2, v2, v1

Total: 14 instructions

// MAD 255.0 + 0.5 (4 instructions)

v_fma_f32 v0, s4, lit(0x437f0000), 0.5

v_fma_f32 v1, s5, lit(0x437f0000), 0.5

v_fma_f32 v2, s6, lit(0x437f0000), 0.5

v_fma_f32 v3, s7, lit(0x437f0000), 0.5

// Convert to integer (4 instructions)

v_cvt_i32_f32 v0, v0

v_cvt_i32_f32 v2, v2

v_cvt_i32_f32 v3, v3

v_cvt_i32_f32 v1, v1

// Permute and OR (3 instructions)

v_perm_b32 v2, v2, v3, lit(0x00040c0c)

v_perm_b32 v0, v0, v1, lit(0x0c0c0004)

v_or_b32 v0, v2, v0

Total: 11 instructions

As you can see the compiler is able to improve our hand-written logic and squeeze a couple extra instructions for our packing using v_perm_b32, an instruction that swizzles values into a single one. We don’t have the high-level instructions to perform the same operations manually which is unfortunate. There are other normalized formats commonly used in videogames that don’t have the same bit width for all components, for example R5G6B5, R5G5B5A1 or R10G10B10A2 formats. We can see how to encode and decode one of them below.

uint PackFloat4ToRGB10A2Unorm(float4 value)

{

uint3 rgbi = uint3(value.rgb * 1023.0 + 0.5);

uint ai = uint(value.a * 3.0 + 0.5);

return (ai << 30) | (rgbi.b << 20) | (rgbi.g << 10) | rgbi.r;

}

float4 UnpackRGB10A2UnormToFloat4(uint packed)

{

uint ri = packed & 0x3ff;

uint gi = (packed >> 10) & 0x3ff;

uint bi = (packed >> 20) & 0x3ff;

uint ai = packed >> 30;

return float4(float3(ri, gi, bi) / 1023.0, ai / 3.0);

}

16 Comments

Life and Death of a Graphics Programmer

By redorav May 16, 2024 June 13, 2025 C++, Graphics

Recurrent internet discussions show a divide between programmers working in different industries. Topics like code clarity, performance, debuggability, architecture or maintainability are a source of friction. We are, paraphrasing the quote, industries divided by a common language. I am curious about other programmers’ experiences, and I wanted to present a general view of mine as a graphics programmer in games, in the form of anecdotes and examples. It’s not meant to be a rant or exhaustive, rather a description of common problems, pitfalls and personal experience sprinkled in. The target audience is either videogame developers who want to nod throughout or developers writing very different software who are curious about what we do. It focuses on C++ and shader languages because that’s mostly what we use.

Hard Requirements

Videogames cram very demanding processing into modest mainstream hardware (consoles, mobile), attempting to run fast and consistently; a combination of I/O, network, audio, physics, pathfinding, low latency input, gameplay, and displaying images on screen in a handful of milliseconds. Similarly, systems like embedded hardware applications (cars, space, low latency trading) are also very constrained but operate in a very specialized domain. On another part of the software spectrum we find UI-centric programs such as word processors, browsers or management software, that are more event-driven and tolerant to a bit more latency.

There are also requirements games don’t have. Most don’t have stringent security concerns like OSs, transportation or banking (except online games or competitive e-sports). Game-breaking bugs aren’t life-threatening. High-frequency trading or automotive image processing applications have very strict correctness requirements, whereas players are mostly tolerant to some glitches as long as they’re having fun. Games don’t distribute their source code or interface with the world’s code so certain API restrictions don’t exist, e.g. we don’t build DLLs or provide SDKs. Some code is specific to a release so there’s a subset that can be hacked together right before shipping.

With that in mind, videogames care about performance in many more areas than others, not just runtime performance but also the tools. Performance becomes part of system correctness. Just as examples, all these situations from different domains are wrong:

Audio lags behind the image, or image lags behind the audio in a cutscene
Networking is too slow in an online game and the games pauses frequently
Streaming is too slow and the game stutters as you traverse
Inputs lags behind the response and causes lack of control

I once saw a cutscene system where the audio is not synced to the video/animation but instead the video tracks the audio, to avoid the typical audio drifts and getting more consistent synchronization between them. Humor and fast action is the essence of those cutscenes, and that’s a creative way to make sure the comedy lands correctly

Waiting for Mr Compiler

I spend an inordinate amount of time waiting for the computer to do things I need to work. Sometimes it’s loading, sometimes processing assets, but most of the time it’s compiling, both C++ code and shaders. Every company I worked for always used C++ for the engine and HLSL for shaders. Compile times are not unique to games, but it is the reality in every large codebase I’ve worked on; a frustrating, soulless ritual necessary to get your code from doing A to doing B. It distracts from doing meaningful work and breaks concentration. It is the very opposite of fast iteration. Let’s just state some bullet points from my experience:

A full rebuild of “the engine” can take anywhere from 10 to 40 minutes. I know of smaller codebases where it’s faster, and there’s definitely worse (e.g. Unreal Engine)
A full rebuild of “the shaders” can also take a really long time, depending on how your shader setup works
An incremental build for a single file change can take anywhere from seconds to a full rebuild’s worth of time, depending on whether you touched a header included everywhere or a cpp with no dependencies
Many shops use Incredibuild to speed up compilation. Even that is often not enough
Code lives in SSD/NVMe drives now, which means I/O is rarely the issue (compiling through the network does reintroduce the problem)
Parallel compilation is standard these days, all cores are engaged in this process
Linking is normally single threaded and can take very long
Throwing more hardware at the problem mitigates it briefly until your codebase inflates again
Some codebases use PCHs and others Unity builds. Both are improvements but also manual and difficult to maintain
We compile for many platforms. A rather extreme example, some LEGO games shipped for 7 platforms simultaneously
Every platform’s tooling is different. You might find that compiling for platform X is much slower than for platform Y

A big part of this problem stems from C’s inclusion model, the ancient and for decades refined scribal technique of copy pasting code, I’ll never understand why C++ didn’t evolve something akin to modules decades earlier and spends time developing library addons that bring anecdotal value and further slowdowns. C++ takes pride in the ‘zero-cost abstraction’ model, but that simply does not apply to compile times. Any time you include a header file in a compilation unit, you are paying a non-negligible cost even if you don’t use anything: many standard library headers take hundreds of milliseconds to compile. If you have thousands of cpps instantiating it, this adds up enormously. C++20 modules are making their way into compilers, but large codebases are going to have a hard time migrating.

There is a constant tension between convenience and compile times. I worked on a codebase where all rendering headers were put inside “render_api.h” and code from other teams included it. It was very simple to set up, but any time I touched a rendering header, it recompiled the entire codebase due to transitive inclusion. Breaking the header apart took a long time whereas putting it in the first place took no effort. Small actions can have large consequences, and the language has not provided a solution for decades

28 Comments

Temporal AA and the Quest for the Holy Trail

By redorav January 2, 2022 May 6, 2025 Graphics

Long gone are the times where Temporal AA was a novel technique, and slowly more articles appear covering motivations, implementations and solutions. I will throw my programming hat into the ring to walk through it, almost like a tutorial, for the future me and for anyone interested. I am using Matt Pettineo’s MSAAFilter demo to show the different stages. The contents come mostly from the invaluable work of many talented developers, and a little from my own experience. I will introduce a couple of tricks I have come across that I haven’t seen in papers or presentations.

Sources of aliasing

The origin of aliasing in CG images varies wildly. Geometric (edge) aliasing, alpha testing, specular highlights, high frequency normals, parallax mapping, low resolution effects (SSAO, SSR), dithering and noise all conspire to destroy our visuals. Some solutions, like hardware MSAA and screen space edge detection techniques, work for a subset of cases but fail in different ways. Temporal techniques attempt to achieve supersampling by distributing the computations across multiple frames, while addressing all forms of aliasing. This stabilizes the image but also creates some challenging artifacts.

Jitter

The main principle of TAA is to compute multiple sub-pixel samples across frames, then combine those together into a single final pixel. The simplest scheme generates random samples within the pixel, but there are better ways of producing fixed sequences of samples. Here’s a short overview of quasi-random sequences. It is important to select a good sequence to avoid clumping, and a discrete number of samples within the sequence: typically between 4-8 work well. In practice this is more important for a static image than a dynamic one. Below a pixel with 4 samples.

To produce random sub-samples within a pixel we translate the projection matrix by a fraction of a pixel along the frustum plane. The valid range for the jitter offset (relative to the pixel center) is half the inverse of the screen dimension in pixels, so $\begin{bmatrix}\dfrac{-1}{2w},\dfrac{1}{2w}\end{bmatrix}$ and $\begin{bmatrix}\dfrac{-1}{2h},\dfrac{1}{2h}\end{bmatrix}$ . We multiply the offset matrix (just a normal translation matrix) by the projection matrix to get the modified projection, as shown below.

$\begin{pmatrix} \dfrac{2n}{w} & 0 & 0 & 0\\ 0 & \dfrac{2n}{h} & 0 & 0\\ 0 & 0 & \dfrac{f}{f-n} & 1\\ 0 & 0 & \dfrac{-f·n}{f-n} & 0\\ \end{pmatrix} · \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ j_x & j_y & 0 & 1\\ \end{pmatrix}= \begin{pmatrix} \dfrac{2n}{w} & 0 & 0 & 0\\ 0 & \dfrac{2n}{h} & 0 & 0\\ j_x & j_y & \dfrac{f}{f-n} & 1\\ 0 & 0 & \dfrac{-f·n}{f-n} & 0\\ \end{pmatrix}$

Once we have a set of samples, we use this matrix to rasterize geometry as normal to produce the image that corresponds to the sample. If it all works well and every frame you get a new jitter, the image should look wobbly like this.

7 Comments

The Rendering of Mafia: Definitive Edition

By redorav August 23, 2021 June 11, 2025 Graphics

Mafia: Definitive Edition (2020) is a remake of the much-loved gangster classic Mafia (2002), originally released for PS2 and Xbox. The game is relatively linear and very story focused, whose narrative I personally found gripping and worthy of being compared to Scarface or Goodfellas. Hangar 13 use their own technology to take on open worlds and stories, previously used for Mafia III, to bring Tommy and the Salieri family to life. It is a DX11 deferred engine on PC, and RenderDoc 1.13 was used to capture and analyze.

The Frame

Tommy looks like he means business with his jacket and fedora, and thus our frame analysis begins. I chose a nighttime city scene as I find it more moody and challenging to get right. Let’s dive right in: I’ll make you a rendering offer you can’t refuse.

Depth Prepass

As we know, a depth prepass is often a careful balance between the time you spend doing it and the time you save by more effective occlusion. Objects seem to be relatively well selected and sorted with depth and size, as by drawcall 120 we actually have a lot of the biggest content in the depth buffer with very simple shaders. Subsequent drawcalls fail the depth test often after that, avoiding wasted work. There are some odd choices like the electricity wires which I assume have large bounding boxes, but most of it makes sense and probably costs little compared to what it saves.

GBuffer Pass

The GBuffer for Mafia packs quite a lot of information. The first texture contains normals and roughness, which is quite standard these days, in 16-bit floating point. While it’s a little large for my taste, normals tend to want as much bit-depth as possible, especially if no compression schemes are used.

R16F	G16F	B16F	A16F
Normal.x	Normal.y	Normal.z	Roughness

GBuffer Normals

The second texture contains albedo and metalness in an 8-bit normalized format, which is also common for PBR engines and relevant if cars sport very reflective chrome components. As you can see, metallic parts are marked as white whereas mostly everything else is black (i.e. non-metal)

R8	G8	B8	A8
Albedo.r	Albedo.g	Albedo.b	Metalness

GBuffer Albedo

The next texture contains packed quantities not easy to decode by inspection. RenderDoc has a neat feature, custom shaders, that will come to our aid. Searching the capture we come across the code for decoding these channels, and after adapting the D3D bytecode back to hlsl, displaying them on screen actually starts to make sense. The first 3 channels are motion vectors (including a z component which I find interesting), and the last channel is the vertex normal encoded in two 8 bit values (z is implicit). It’s interesting to note that vertex normals have only been given 2 bytes as opposed to the 6 bytes assigned to per-pixel normals. Vertex normals are an unusual thing to output, but we’ll soon find out why.

R16U	G16U	B16U	A16U
MotionVector.x	MotionVector.y	MotionVector.z	Encoded Vertex Normal

GBuffer Encoded Motion Vectors

The fourth texture contains miscellaneous quantities such as specular intensity, curvature and profile for subsurface scattering and flags. The G component is set to 0.5 so it may be an unused/spare channel for future usage.

R8	G8	B8	A8
Specular Intensity	0.5	Curvature or Thickness (for SSS)	SSS Profile

GBuffer Specular Intensity

The last entry in the GBuffer is the emissive lighting, which becomes the main lighting buffer from now on.

R11F	G11F	B10F
Emissive.r	Emissive.g	Emissive.b

One interesting performance decision for the GBuffer is not clearing it at the start of the frame. Sometimes clearing a buffer is necessary, but you can avoid the cost if you’re going to overwrite the contents and you know where (by marking it in the stencil). There are other performance penalties involved in clearing depending on platform, so the gist of it is it’s never a bad idea to avoid clearing if you can.

Continue reading

12 Comments

A Macro View of Nanite

By redorav May 30, 2021 December 4, 2024 Graphics

After showing an impressive demo last year and unleashing recently with the UE5 preview, Nanite is all the rage these days. I just had to go in and have some fun trying to figure it out and explain how I think it operates and the technical decisions behind it using a renderdoc capture. Props to Epic for being open with their tech which makes it easier to learn and pick apart; the editor has markers and debug information that are going to be super helpful.

This is the frame we’re going to be looking at, from the epic showdown in the Valley of the Ancient demo project. It shows the interaction between Nanite and non-Nanite geometry and it’s just plain badass.

Nanite::CullRasterize

The first stage in this process is Nanite::CullRasterize, and it looks like this. In a nutshell, this entire pass is responsible for culling instances and triangles and rasterizing them. We’ll refer to it as we go through the capture.

Instance Culling

Instance culling is one of the first things that happens here. It looks to be a GPU form of frustum and occlusion culling. There is instance data and primitive data bound here, I guess it means it culls at the instance level first, and if the instance survives it starts culling at a finer-grained level. The Nanite.Views buffer provides camera info for frustum culling, and hierarchical depth buffer (HZB) is used for occlusion culling. The HZB is sourced from the previous frame and forward projected to this one. I’m not sure how it deals with dynamic objects, it may be that it uses such a large mip (small resolution) that it is conservative enough. EDIT: According to the Nanite paper, the HZB is generated this frame with the previous frame’s visible objects. The HZB is tested with the previous objects as well as anything new and visibility updated for the next frame.

Both visible and non-visible instances are written into buffers. For the latter I’m thinking this is the way of doing what occlusion queries used to do in the standard mesh pipeline: inform the CPU that a certain entity is occluded and it should stop processing until it becomes visible. The visible instances are also written out into a list of candidates.

Persistent Culling

Persistent culling seems to be related to streaming. It is a fixed number of compute threads, suggesting it is unrelated to the complexity of the scene and instead maybe checks some spatial structure for occlusion. This is one complicated shader, but based on the inputs and outputs we can see it writes out how many triangle clusters are visible of each type (compute and traditional raster) into a buffer called MainRasterizeArgsSWHW (SW:compute, HW:raster).

Clustering and LODding

It’s worth mentioning LODs at this point as it is probably around here where those decisions are made. Some people speculated geometry images as a way to do continuous LODding but I see no indication of this. Triangles are grouped into patches called clusters, and some amount of culling is done at the cluster level. The clustering technique has been described before in papers by Ubisoft and Frostbite. For LODs, clusters start appearing and disappearing as the level of detail descends within instances. Some very clever magical incantations are employed here that ensure all the combinations of clusters stitch into each other seamlessly.

Continue reading

2 Comments

The Rendering of Jurassic World: Evolution

By redorav March 12, 2021 December 4, 2024 Graphics

Jurassic World: Evolution is the kind of game many kids (and adult-kids) dreamed of for a long time. What’s not to like about a game that gives you the reins of a park where the main attractions are 65-million-year-old colossal beasts? This isn’t the first successful amusement park game by Frontier Developments, but it’s certainly not your typical one. Frontier is a proud developer of their Cobra technology, which has been evolving since 1988. For JWE in particular it is a DX11 tiled forward renderer. For the analysis I used Renderdoc and turned on all the graphics bells and whistles. Welcome… to Jurassic Park.

The Frame

It’s hard to decide what to present as a frame for this game, because free navigation and dynamic time of day means you have limitless possibilities. I chose a moody, rainy intermediate view that captures the dark essence of the original movies taking advantage of the Capture Mode.

Compute Shaders

The first thing to notice about the frame is that it is very compute-heavy. In the absence of markers, Renderdoc splits rendering into passes if there are more than one Draw or Dispatch commands targeting the same output buffers. According to the capture there are 15 compute vs 18 color/depth passes, i.e. it is broadly split into half compute, half draw techniques. Compute can be more flexible than draw (and, if done correctly, faster) but a lot of time has to be spent fine-tuning and balancing performance. Frontier clearly spared no expense developing the technology to get there, however this also means that analyzing a frame is a bit harder.

Grass Displacement

A big component of JWE is foliage and its interaction with cars, dinosaurs, wind, etc. To animate the grass, one of the very first processes populates a top-down texture that contains grass displacement information. This grass displacement texture is later read in the vertex shader of all the grass in the game, and the information used to modify the position of the vertices of each blade of grass. The texture wraps around as the camera moves and fills in the new regions that appear at the edges. This means that the texture doesn’t necessarily look like a top-down snapshot of the scene, but will typically be split into 4 quadrants. The process involves these steps:

Render dinosaurs and cars, probably other objects such as the gyrospheres. This doesn’t need an accurate version of the geometry, e.g. cars only render wheels and part of the chassis, which is in contact with grass. The result is a top down depth buffer (leftmost image). If you squint you’ll see the profile of an ankylosaurus. The other dinosaurs aren’t rendered here, perhaps the engine knows they aren’t stepping on grass and optimizes them out.
Take this depth buffer and a heightmap of the scene (center image), and output three quantities: a mask to tell whether the depth of the object was above/below the terrain, the difference in depth between them, and the actual depth and pack them in a 3-channel texture (rightmost image)

An additional process simulates wind. In this particular scene there is a general breeze from the storm plus a helicopter, both producing currents that displace grass. This is a top down texture similar to the one before containing motion vectors in 2D. The motion for the wind is an undulating texture meant to mimic wind waves which seems to have been computed on the CPU, and the influence of the helicopter is cleverly done blending a stream of particles on top of the first texture. You can see it in the image as streams pulling outward. Dinosaur and car motion is also blended here. I’m not entirely sure what the purpose of the repeating texture is (you can see the same objects repeated multiple times).

No comments

Rendering Line Lights

By redorav July 10, 2019 December 4, 2024 C++, Graphics

Within the arsenal of lights provided by game engines, the most popular are punctual lights such as point, spot or directional because they are cheap. On the other end, area lights have recently produced incredible techniques such as Linearly Transformed Cosines and other analytic approximations. I want to talk about the line light.

Update [04/09/2020] When I originally wrote the article there were no public images showing Jedi or lightsabers so I couldn’t make the connection (though a clever reader could have concluded what they might be for!) I can finally show this work off as it’s meant to be. You can also watch a gameplay trailer here.

In Unreal Engine 4, modifying ‘Source Length’ on a point light elongates it as described in this paper. It spreads the intensity along the length so a longer light becomes perceptually dimmer. Frostbite also has tube lights, a complex implementation of the analytical illuminance emitted by a cylinder and two spheres. Unity includes tube lights as well in their HD Render Pipeline (thanks Eric Heitz and Evegenii Golubev for pointing it out) based on their LTC theory, which you can find a great explanation and demos for here. Guerrilla Games’ Decima Engine has elongated quad lights using an approach for which they have a very attractive and thorough explanation in GPU Pro 5’s chapter II.1, Physically Based Area Lights. This is what I adapted to line lights.

13 Comments

The Rendering of Rise of the Tomb Raider

By redorav December 31, 2018 December 4, 2024 Graphics

[latexpage]

Rise of the Tomb Raider (2015) is the sequel to the excellent Tomb Raider (2013) reboot. I personally find both refreshing as they move away from the stagnating original series and retell the Croft story. The game is story focused and, like its prequel, offers enjoyable crafting, hunting and climbing/exploring mechanics.

Tomb Raider used the Crystal Engine, developed by Crystal Dynamics also used in Deus Ex: Human Revolution. For the sequel a new engine called Foundation was used, previously developed for Lara Croft and the Temple of Osiris (2014). Its rendering can be broadly classified as a tiled light-prepass engine, and we’ll see what that means as we dive in. The engine offers the choice between a DX11 and DX12 renderer; I chose the latter for reasons we’ll see later. I used Renderdoc 1.2 to capture the frame, on a Geforce 980 Ti, and turned on all the bells and whistles.

The Frame

I can safely say without spoilers that in this frame bad guys chase Lara because she’s looking for an artifact they’re looking for too, a conflict of interest that absolutely must be resolved using weapons. Lara is inside the enemy base at nighttime. I chose a frame with atmospheric and contrasty lighting where the engine can show off.

Depth Prepass

A customary optimization in many games, a small depth prepass takes place here (~100 draw calls). The game renders the biggest objects (rather the ones that take up the most screen space), to take advantage of the Early-Z capability of GPUs. A concise article by Intel explains further. In short, the GPU can avoid running a pixel shader if it can determine it’s occluded behind a previous pixel. It’s a relatively cheap pass that will pre-populate the Z-buffer with depth.

An interesting thing I found is a level of detail (LOD) technique called ‘fizzle’ or ‘checkerboard’. It’s a common way to fade objects in and out at a distance, either to later replace it with a lower quality mesh or to completely make it disappear. Take a look at this truck. It seems to be rendering twice, but in reality it’s rendering a high LOD and a low LOD at the same position, each rendering to the pixels the other is not rendering to. The first LOD is 182226 vertices, whereas the second LOD is 47250. They’re visually indistinguishable at a distance, and yet one is 3 times cheaper. In this frame, LOD 0 has almost disappeared while LOD 1 is almost fully rendered. Once LOD 0 completely disappears, only LOD 1 will render.

A pseudorandom texture and a probability factor allow us to discard pixels that don’t pass a threshold. You can see this texture used in ROTR. You might be asking yourself why not use alpha blending. There are many disadvantages to alpha blending over fizzle fading.

Depth prepass-friendly: By rendering it like an opaque object and puncturing holes, we can still render into the prepass and take advantage of early-z. Alpha blended objects don’t render into the depth buffer this early due to sorting issues.
Needs extra shader(s): If you have a deferred renderer, your opaque shader doesn’t do any lighting. You need a separate variant that does if you’re going to swap an opaque object for a transparent one. Aside from the memory/complexity cost of having at least an extra shader for all opaque objects, they need to be accurate to avoid popping. There are many reasons why this is hard, but it boils down to the fact they’re now rendering through a different code path.
More overdraw: Alpha blending can produce more overdraw and depending on the complexity of your objects you might find yourself paying a large bandwidth cost for LOD fading.
Z-fighting: z-fighting is the flickering effect when two polygons render to a very similar depth such that floating point imprecision causes them to “take turns” to render. If we render two consecutive LODs by fading one out and the next one in, they might z-fight since they’re so close together. There are ways around it like biasing one over the other but it gets tricky.
Z-buffer effects: Many effects like SSAO rely on the depth buffer. If we render transparent objects at the end of the pipeline when ambient occlusion has run already, we won’t be able to factor them in.

One disadvantage of this technique is that it can look worse than alpha fading, but a good noise pattern, post-fizzle blurring or temporal AA can hide it to a large extent. ROTR doesn’t do anything fancy in this respect.

Normals Pass

Crystal Dynamics uses a relatively unusual lighting scheme for its games that we’ll describe in the lighting pass. For now suffice it to say that there is no G-Buffer pass, at least not in the sense that other games have us accustomed to. Instead, the objects in this pass only output depth and normals information. Normals are written to an RGBA16_SNORM render target in world space. As a curiosity, this engine uses Z-up as opposed to Y-up which is what I see more often in other engines/modelling packages. The alpha channel contains glossiness, which will be decompressed later as exp2(glossiness * 12 + 1.0). The glossiness value can actually be negative, as the sign is used as a flag to indicate whether a surface is metallic or not. You can almost spot it yourself, as the darker colors in the alpha channel are all metallic objects.

R	G	B	A
Normal.x	Normal.y	Normal.z	Glossiness + Metalness

Normals

No comments

A Real Life Pinhole Camera

By redorav January 14, 2018 December 4, 2024 Graphics

When I married last year, my wife and I went on our honeymoon to Thailand. Their king Bhumibol had died a month ago and the country was mourning. Everywhere we found good wishes and memorials, and people would dress in black and white as a sign of sorrow. The Thai are a gentle and polite people who like to help out; we’d ask for directions and people with little notions of English would spend twenty minutes understanding and answering our questions. Thailand has a rich history of rising and falling kingdoms, great kings and battles, and unification and invasions by foreign kingdoms. There are some amazing ruins of these kingdoms. Thailand also lives by a variant of Buddhism reflected in their beautiful temples. Some of the architectural features I found most interesting are the small reflective tiles that cover the outer walls, animal motives like the Garuda, (bird creatures that can be seen on the rooftops) and snake-like creatures called Naga It is in this unexpected context that I found a real-life pinhole camera. I always wear my graphics hat so I decided to capture it and later make a post.

First, a little background. A pinhole camera (also known as camera obscura after its latin name) is essentially the simplest camera you can come up with. If you conceptually imagine a closed box that has a single, minuscule hole in one of its faces, such that a single ray from each direction can come inside, you’d have a mirrored image at the inner face of the other side of the box to where the pinhole is. An image is worth more than a thousand explanations, so here’s what I’m talking about.

Pinhole Diagram

As you can see, the concept is simple. If you were inside the room, you’d see an inverted image of the outside. The hole is so small the room would be fairly dark so even the faint light now bouncing back towards you would still be visible. I made the pinhole a hexagon, as I wanted to suggest the fact that it is effectively the shutter of a modern camera. Louis Daguerre, one of the fathers of photography, used this model in his famous daguerreotype circa 1835, but Leonardo da Vinci had already described this phenomenon as an oculus artificialis (artificial eye) in one of his works in as early as 1502. There are plenty additional resources if you’re interested and even a pretty cool tutorial on how to create your own.

Now that we understand what this camera is, let’s look at the real image I encountered. I’ve aligned the inside and outside images I took and cast rays so you can see what I mean.

Real Pinhole Camera

The image of the inside looks bright but I had to take it with 1 second of exposure and it still looks relatively dark. On top of that the day outside was very sunny which helped a lot in getting a clear “photograph”.

16 Comments

The Rendering of Middle Earth: Shadow of Mordor

By redorav December 27, 2017 December 4, 2024 Graphics

Middle Earth: Shadow of Mordor was released in 2014. The game itself was a great surprise, and the fact that it was a spin-off within the storyline of the Lord of the Rings universe was quite unusual and it’s something I enjoyed. The game was a great success, and at the time of writing, Monolith has already released the sequel, Shadow of War. The game’s graphics are beautiful, especially considering it was a cross-generation game and was also released on Xbox 360 and PS3. The PC version is quite polished and features a few extra graphical options and hi-resolution texture packs that make it shine.

The game uses a relatively modern deferred DX11 renderer. I used Renderdoc to delve into the game’s rendering techniques. I used the highest possible graphical settings (ultra) and enabled all the bells and whistles like order-independent transparency, tessellation, screen-space occlusion and the different motion blurs.

The Frame

This is the frame we’ll be analyzing. We’re at the top of a wooden scaffolding in the Udun region. Shadow of Mordor has similar mechanics to games like Assassin’s Creed where you can climb buildings and towers and enjoy some beautiful digital scenery from them.

Depth Prepass

The first ~140 draw calls perform a quick prepass to render the biggest elements of the terrain and buildings into the depth buffer. Most things don’t end up appearing in this prepass, but it helps when you’ve got a very big number of draw calls and a far range of view. Interestingly the character, who is always in front and takes a decent amount of screen space, does not go into the prepass. As is common for many open world games, the game employs reverse z, a technique that maps the near plane to 1.0 and far plane to 0.0 for increased precision at great distances and to prevent z-fighting. You can read more about z-buffer precision here.

G-buffer

Right after that, the G-Buffer pass begins, with around ~2700 draw calls. If you’ve read my previous analysis for Castlevania: Lords of Shadow 2 or have read other similar articles, you’ll be familiar with this pass. Surface properties are written to a set of buffers that are read later on by lighting passes to compute its response to the light. Shadow of Mordor uses a classical deferred renderer, but uses a comparably small amount of G-buffer render targets (3) to achieve its objective. Just for comparison, Unreal Engine uses between 5 and 6 buffers in this pass. The G-buffer layout is as follows:

Normals Buffer

R	G	B	A
Normal.x	Normal.y	Normal.z	ID

The normals buffer stores the normals in world space, in 8-bit per channel format. This is a little bit tight, sometimes not enough to accurately represent smoothly varying flat surfaces, as can be seen in some puddles throughout the game if paying close attention. The alpha channel is used as an ID that marks different types of objects. Some that I’ve found correspond to a character (255), an animated plant or flag (128), and the sky is marked with ID 1, as it’s later used to filter it out during the bloom phase (it gets its own radial bloom).

World Space Normals

Normalized Data

Hard Requirements

Waiting for Mr Compiler

Sources of aliasing

Jitter

The Frame

Depth Prepass

GBuffer Pass

Nanite::CullRasterize

Instance Culling

Persistent Culling

Clustering and LODding

The Frame

Compute Shaders

Grass Displacement

The Frame

Depth Prepass

Normals Pass

The Frame

Depth Prepass

G-buffer

Normals Buffer

Posts