Life and Death of a Graphics Programmer

Recurrent internet discussions show a divide between programmers working in different industries. Topics like code clarity, performance, debuggability, architecture or maintainability are a source of friction. We are, paraphrasing the quote, industries divided by a common language. I am curious about other programmers’ experiences, and I wanted to present a general view of mine as a graphics programmer in games, in the form of anecdotes and examples. It’s not meant to be a rant or exhaustive, rather a description of common problems, pitfalls and personal experience sprinkled in. The target audience is either videogame developers who want to nod throughout or developers writing very different software who are curious about what we do. It focuses on C++ and shader languages because that’s mostly what we use.

Hard Requirements

Videogames cram very demanding processing into modest mainstream hardware (consoles, mobile), attempting to run fast and consistently; a combination of I/O, network, audio, physics, pathfinding, low latency input, gameplay, and displaying images on screen in a handful of milliseconds. Similarly, systems like embedded hardware applications (cars, space, low latency trading) are also very constrained but operate in a very specialized domain. On another part of the software spectrum we find UI-centric programs such as word processors, browsers or management software, that are more event-driven and tolerant to a bit more latency.

There are also requirements games don’t have. Most don’t have stringent security concerns like OSs, transportation or banking (except online games or competitive e-sports). Game-breaking bugs aren’t life-threatening. High-frequency trading or automotive image processing applications have very strict correctness requirements, whereas players are mostly tolerant to some glitches as long as they’re having fun. Games don’t distribute their source code or interface with the world’s code so certain API restrictions don’t exist, e.g. we don’t build DLLs or provide SDKs. Some code is specific to a release so there’s a subset that can be hacked together right before shipping.

With that in mind, videogames care about performance in many more areas than others, not just runtime performance but also the tools. Performance becomes part of system correctness. Just as examples, all these situations from different domains are wrong:

  • Audio lags behind the image, or image lags behind the audio in a cutscene
  • Networking is too slow in an online game and the games pauses frequently
  • Streaming is too slow and the game stutters as you traverse
  • Inputs lags behind the response and causes lack of control
I once saw a cutscene system where the audio is not synced to the video/animation but instead the video tracks the audio, to avoid the typical audio drifts and getting more consistent synchronization between them. Humor and fast action is the essence of those cutscenes, and that’s a creative way to make sure the comedy lands correctly

Waiting for Mr Compiler

I spend an inordinate amount of time waiting for the computer to do things I need to work. Sometimes it’s loading, sometimes processing assets, but most of the time it’s compiling, both C++ code and shaders. Every company I worked for always used C++ for the engine and HLSL for shaders. Compile times are not unique to games, but it is the reality in every large codebase I’ve worked on; a frustrating, soulless ritual necessary to get your code from doing A to doing B. It distracts from doing meaningful work and breaks concentration. It is the very opposite of fast iteration. Let’s just state some bullet points from my experience:

  • A full rebuild of “the engine” can take anywhere from 10 to 40 minutes. I know of smaller codebases where it’s faster, and there’s definitely worse (e.g. Unreal Engine)
  • A full rebuild of “the shaders” can also take a really long time, depending on how your shader setup works
  • An incremental build for a single file change can take anywhere from seconds to a full rebuild’s worth of time, depending on whether you touched a header included everywhere or a cpp with no dependencies
  • Many shops use Incredibuild to speed up compilation. Even that is often not enough
  • Code lives in SSD/NVMe drives now, which means I/O is rarely the issue (compiling through the network does reintroduce the problem)
  • Parallel compilation is standard these days, all cores are engaged in this process
  • Linking is normally single threaded and can take very long
  • Throwing more hardware at the problem mitigates it briefly until your codebase inflates again
  • Some codebases use PCHs and others Unity builds. Both are improvements but also manual and difficult to maintain
  • We compile for many platforms. A rather extreme example, some LEGO games shipped for 7 platforms simultaneously
  • Every platform’s tooling is different. You might find that compiling for platform X is much slower than for platform Y

A big part of this problem stems from C’s inclusion model, the ancient and for decades refined scribal technique of copy pasting code, I’ll never understand why C++ didn’t evolve something akin to modules decades earlier and spends time developing library addons that bring anecdotal value and further slowdowns. C++ takes pride in the ‘zero-cost abstraction’ model, but that simply does not apply to compile times. Any time you include a header file in a compilation unit, you are paying a non-negligible cost even if you don’t use anything: many standard library headers take hundreds of milliseconds to compile. If you have thousands of cpps instantiating it, this adds up enormously. C++20 modules are making their way into compilers, but large codebases are going to have a hard time migrating.

There is a constant tension between convenience and compile times. I worked on a codebase where all rendering headers were put inside “render_api.h” and code from other teams included it. It was very simple to set up, but any time I touched a rendering header, it recompiled the entire codebase due to transitive inclusion. Breaking the header apart took a long time whereas putting it in the first place took no effort. Small actions can have large consequences, and the language has not provided a solution for decades

A Template to Confusion

In The Sorcerer’s Apprentice, the protagonist is tired of trudging along carrying water when he has the idea to leverage his master’s magic to do it for him. His lack of experience backfires as things spiral out of control because he cannot remember how to undo the spell. In a similar fashion C++ template and macro magic promises to help with many problems but can cause lots of headaches later on. I’ve seen the allure: fighting the compiler hard to get it to do something specific gives a sense of accomplishment. There might also be an element of sunken cost to it. In any case, overusing it is trivial and undoing it not easy.

One codebase had a powerful shader reflection facility with lots of template and macro metahackery which worked if you left it alone but was incredibly hard to debug, modify and extend. In hindsight, an alternative like code generation could have worked better

There are valid reasons for its usage but they are often misused and have a lot of downsides:

  • Template or macro code beyond the basic is difficult to read and debug
  • Error messages are difficult to understand unless you’re experienced
  • Template resolution rules are hard to memorize and predict, SFINAE is very complicated
  • Templates are slow to compile (see rule of Chiel)
  • Template rules are enforced differently in different compilers, adding complexity
  • The STL’s usage of templates is essentially infinite, a non-trivial language on top of C++
I plead guilty to using many templates in my own math library. To prove the point however, when I removed them the library ended up with (paradoxically) less code, faster to compile and functionally the same. I now avoid templates in a first approach to problems. There is a route to enlightenment where you discover the magic, abuse it, then reel it back in

Unfortunately, there is an assymmetry here: liberal use of templates does a lot of damage to large codebases that you don’t counter by not using them, and adding headers that include template code is easier than it is to remove them.

Death By a Thousand Shades

As graphics programmers, shaders are our bread and butter. Here are these relatively small programs that run millions of times in parallel on the GPU to produce pretty images on screen. The shader languages we use to write them are very simple in nature; all the code is inlined, recursion is not allowed, templates are very recent, so you’d think it wouldn’t be much of a pipeline issue. However, reality is always more complicated, because there are so many shaders!

Shader philosophies differ in two axes that I’ll call responsibility and usage. Responsibility refers to who has access to authoring the shaders; it can be just graphics programmers, or technical artists, or general artists and even designers; generally, the more people you have making shaders, the more you’ll have to compile. The usage axis refers to where in the frame these shaders run; for example, shaders that describe material properties typically run during a geometry phase, and lighting or post effect shaders tend to run at the end of the frame in a fullscreen pass. The geometry or material phase is typically where most variation comes from, as you’ll have shaders doing opaque surfaces, cloth, hair, skin, transparency, etc. These variations exist in many places of the frame: for example, depth-only shaders for shadows or depth prepass, transparency shaders, variations for using lightmaps, etc. if I had to guess I’d say that 95% of shaders in a complex game fall into this category. Since the shaders that artists modify is in this second category, we encounter the so-called combinatorial shader explosion.

Luciano Jacomeli - Compiling Shader 4.24

To put into perspective the extent to which these philosophies vary, consider that Doom 2016 keeps a tight control over shaders where only graphics programmers can modify them, and apparently needs a few hundred shaders, whereas on the other side a typical Unreal Engine project probably ships with 10,000 shaders. Within these extremes, I’ve worked at places where only programmers could create and modify shaders, places where only the technical artists could create the material shaders and places where any artist could create them. The compile times for these range wildly for a full rebuild of shaders.

This of course is only part of the problem, as compilation (at least on PC/mobile) is two-staged: the compilation that happens on the devs’ machines, and the one that happens on the users’ machines. This first step takes the textual shader written either manually by a programmer or through a node graph tool and produces a temporary, optimized common representation of the shader. This needs to be translated to the vendor-specific shader instructions during PSO compilation such as AMD’s RDNA ISA, hence the #stutterstruggle that has become somewhat of a meme in PC gaming lately.

One place I worked at had an interesting shader compilation philosophy: when artists saved the material they were authoring, they would compile for all platforms and the binaries would get uploaded to version control. This had the really nice property that nobody else had to compile that shader on their machine, and the disadvantage that making sweeping changes to the shaders became difficult

A Heap of Trouble

Another big problem that makes games difficult is heap allocations. Because games are so dynamic, they tend to spawn and destroy things left and right, be it particles, debris from destruction, short-lived sounds, network packets, etc. In rendering specifically, every frame we prepare and discard thousands of rendering commands, short-lived vertices or per-frame constant buffers. The volatile nature of it is such that if we mainly used heap allocations to do these things, we would encounter:

  • Fragmentation: running out of memory that’s full of holes, small allocations wasting memory divided into larger blocks
  • Contention/Unpredictability: threads will block each other for shared resources, hitching at unpredictable times
  • Cache: a new allocation is essentially a cache miss

A large part of optimization really boils down to avoiding heap allocations. Instead, games reserve large blocks at boot time, subdivided according to a resource type, create object pools to reuse the memory, arena allocators, or just use the stack, among other techniques. If you e.g. need scratch memory for sorting, allocating on the stack is essentially free and trivial to dispose of. Containers, structures and allocators that encourage this memory pattern are essential. If you’re interested in memory allocation strategies I’d recommend giving this article a read too.

Years ago I worked programming Android games in Java. Because the language doesn’t provide value types, even vector math and string processing would keep the heap/garbage collector active and hitching all the time. We resorted to global StringBuffers and Vectors for intermediate calculations, a very cumbersome and error-prone use of the language

Continue reading

Temporal AA and the quest for the Holy Trail

Long gone are the times where Temporal AA was a novel technique, and slowly more articles appear covering motivations, implementations and solutions. I will throw my programming hat into the ring to walk through it, almost like a tutorial, for the future me and for anyone interested. I am using Matt Pettineo’s MSAAFilter demo to show the different stages. The contents come mostly from the invaluable work of many talented developers, and a little from my own experience. I will introduce a couple of tricks I have come across that I haven’t seen in papers or presentations.

Sources of aliasing

The origin of aliasing in CG images varies wildly. Geometric (edge) aliasing, alpha testing, specular highlights, high frequency normals, parallax mapping, low resolution effects (SSAO, SSR), dithering and noise all conspire to destroy our visuals. Some solutions, like hardware MSAA and screen space edge detection techniques, work for a subset of cases but fail in different ways. Temporal techniques attempt to achieve supersampling by distributing the computations across multiple frames, while addressing all forms of aliasing. This stabilizes the image but also creates some challenging artifacts.

Jitter

The main principle of TAA is to compute multiple sub-pixel samples across frames, then combine those together into a single final pixel. The simplest scheme generates random samples within the pixel, but there are better ways of producing fixed sequences of samples. A short overview of quasi-random sequences can be found here. It is important to select a good sequence to avoid clumping, and a discrete number of samples within the sequence: typically between 4-8 work well. In practice this is more important for a static image than a dynamic one. Below a pixel with 4 samples.

To produce random sub-samples within a pixel we translate the projection matrix by a fraction of a pixel along the frustum plane. The valid range for the jitter offset (relative to the pixel center) is half the inverse of the screen dimension in pixels, so \begin{bmatrix}\dfrac{-1}{2w},\dfrac{1}{2w}\end{bmatrix} and \begin{bmatrix}\dfrac{-1}{2h},\dfrac{1}{2h}\end{bmatrix}. We multiply the offset matrix (just a normal translation matrix) by the projection matrix to get the modified projection, as shown below.

 

\begin{pmatrix} \dfrac{2n}{w} & 0 & 0 & 0\\ 0 & \dfrac{2n}{h} & 0 & 0\\ 0 & 0 & \dfrac{f}{f-n} & 1\\ 0 & 0 & \dfrac{-f·n}{f-n} & 0\\ \end{pmatrix} · \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ j_x & j_y & 0 & 1\\ \end{pmatrix}= \begin{pmatrix} \dfrac{2n}{w} & 0 & 0 & 0\\ 0 & \dfrac{2n}{h} & 0 & 0\\ j_x & j_y & \dfrac{f}{f-n} & 1\\ 0 & 0 & \dfrac{-f·n}{f-n} & 0\\ \end{pmatrix}

 

Once we have a set of samples, we use this matrix to rasterize geometry as normal to produce the image that corresponds to the sample. If it all works well and every frame you get a new jitter, the image should look wobbly like this.

Continue reading

The Rendering of Mafia: Definitive Edition

Mafia: Definitive Edition (2020) is a remake of the much-loved gangster classic Mafia (2002), originally released for PS2 and Xbox. The game is relatively linear and very story focused, whose narrative I personally found gripping and worthy of being compared to Scarface or Goodfellas. Hangar 13 use their own technology to take on open worlds and stories, previously used for Mafia III, to bring Tommy and the Salieri family to life. It is a DX11 deferred engine on PC, and RenderDoc 1.13 was used to capture and analyze.

The Frame

Tommy looks like he means business with his jacket and fedora, and thus our frame analysis begins. I chose a nighttime city scene as I find it more moody and challenging to get right. Let’s dive right in: I’ll make you a rendering offer you can’t refuse.

Depth Prepass

As we know, a depth prepass is often a careful balance between the time you spend doing it and the time you save by more effective occlusion. Objects seem to be relatively well selected and sorted with depth and size, as by drawcall 120 we actually have a lot of the biggest content in the depth buffer with very simple shaders. Subsequent drawcalls fail the depth test often after that, avoiding wasted work. There are some odd choices like the electricity wires which I assume have large bounding boxes, but most of it makes sense and probably costs little compared to what it saves.

GBuffer Pass

The GBuffer for Mafia packs quite a lot of information. The first texture contains normals and roughness, which is quite standard these days, in 16-bit floating point. While it’s a little large for my taste, normals tend to want as much bit-depth as possible, especially if no compression schemes are used.

R16F G16F B16F A16F
Normal.x Normal.y Normal.z Roughness

GBuffer Normals
GBuffer Roughness
 
previous arrow
next arrow

 

The second texture contains albedo and metalness in an 8-bit normalized format, which is also common for PBR engines and relevant if cars sport very reflective chrome components. As you can see, metallic parts are marked as white whereas mostly everything else is black (i.e. non-metal)

R8 G8 B8 A8
Albedo.r Albedo.g Albedo.b Metalness

GBuffer Albedo
GBuffer Metalness
 
previous arrow
next arrow

 

The next texture contains packed quantities not easy to decode by inspection. RenderDoc has a neat feature, custom shaders, that will come to our aid. Searching the capture we come across the code for decoding these channels, and after adapting the D3D bytecode back to hlsl, displaying them on screen actually starts to make sense. The first 3 channels are motion vectors (including a z component which I find interesting), and the last channel is the vertex normal encoded in two 8 bit values (z is implicit). It’s interesting to note that vertex normals have only been given 2 bytes as opposed to the 6 bytes assigned to per-pixel normals. Vertex normals are an unusual thing to output, but we’ll soon find out why.

R16U G16U B16U A16U
MotionVector.x MotionVector.y MotionVector.z Encoded Vertex Normal

GBuffer Encoded Motion Vectors
GBuffer Decoded Motion Vectors
GBuffer Encoded Vertex Normal
GBuffer Decoded Vertex Normal
 
previous arrow
next arrow

 

The fourth texture contains miscellaneous quantities such as specular intensity, curvature and profile for subsurface scattering and flags. The G component is set to 0.5 so it may be an unused/spare channel for future usage.

R8 G8 B8 A8
Specular Intensity 0.5 Curvature or Thickness (for SSS) SSS Profile
GBuffer Specular Intensity
GBuffer Curvature
GBuffer SSS Profile
 
previous arrow
next arrow

 

The last entry in the GBuffer is the emissive lighting, which becomes the main lighting buffer from now on.

R11F G11F B10F
Emissive.r Emissive.g Emissive.b

One interesting performance decision for the GBuffer is not clearing it at the start of the frame. Sometimes clearing a buffer is necessary, but you can avoid the cost if you’re going to overwrite the contents and you know where (by marking it in the stencil). There are other performance penalties involved in clearing depending on platform, so the gist of it is it’s never a bad idea to avoid clearing if you can.

Continue reading

A Macro View of Nanite

After showing an impressive demo last year and unleashing recently with the UE5 preview, Nanite is all the rage these days. I just had to go in and have some fun trying to figure it out and explain how I think it operates and the technical decisions behind it using a renderdoc capture. Props to Epic for being open with their tech which makes it easier to learn and pick apart; the editor has markers and debug information that are going to be super helpful.

This is the frame we’re going to be looking at, from the epic showdown in the Valley of the Ancient demo project. It shows the interaction between Nanite and non-Nanite geometry and it’s just plain badass.

Nanite::CullRasterize

The first stage in this process is Nanite::CullRasterize, and it looks like this. In a nutshell, this entire pass is responsible for culling instances and triangles and rasterizing them. We’ll refer to it as we go through the capture.

Instance Culling

Instance culling is one of the first things that happens here. It looks to be a GPU form of frustum and occlusion culling. There is instance data and primitive data bound here, I guess it means it culls at the instance level first, and if the instance survives it starts culling at a finer-grained level. The Nanite.Views buffer provides camera info for frustum culling, and hierarchical depth buffer (HZB) is used for occlusion culling. The HZB is sourced from the previous frame and forward projected to this one. I’m not sure how it deals with dynamic objects, it may be that it uses such a large mip (small resolution) that it is conservative enough. EDIT: According to the Nanite paper, the HZB is generated this frame with the previous frame’s visible objects. The HZB is tested with the previous objects as well as anything new and visibility updated for the next frame.

Both visible and non-visible instances are written into buffers. For the latter I’m thinking this is the way of doing what occlusion queries used to do in the standard mesh pipeline: inform the CPU that a certain entity is occluded and it should stop processing until it becomes visible. The visible instances are also written out into a list of candidates.

Persistent Culling

Persistent culling seems to be related to streaming. It is a fixed number of compute threads, suggesting it is unrelated to the complexity of the scene and instead maybe checks some spatial structure for occlusion. This is one complicated shader, but based on the inputs and outputs we can see it writes out how many triangle clusters are visible of each type (compute and traditional raster) into a buffer called MainRasterizeArgsSWHW (SW:compute, HW:raster).

Clustering and LODding

It’s worth mentioning LODs at this point as it is probably around here where those decisions are made. Some people speculated geometry images as a way to do continuous LODding but I see no indication of this. Triangles are grouped into patches called clusters, and some amount of culling is done at the cluster level. The clustering technique has been described before in papers by Ubisoft and Frostbite. For LODs, clusters start appearing and disappearing as the level of detail descends within instances. Some very clever magical incantations are employed here that ensure all the combinations of clusters stitch into each other seamlessly.

Continue reading

The Rendering of Jurassic World: Evolution

Jurassic World: Evolution is the kind of game many kids (and adult-kids) dreamed of for a long time. What’s not to like about a game that gives you the reins of a park where the main attractions are 65-million-year-old colossal beasts? This isn’t the first successful amusement park game by Frontier Developments, but it’s certainly not your typical one. Frontier is a proud developer of their Cobra technology, which has been evolving since 1988. For JWE in particular it is a DX11 tiled forward renderer. For the analysis I used Renderdoc and turned on all the graphics bells and whistles. Welcome… to Jurassic Park.

The Frame

It’s hard to decide what to present as a frame for this game, because free navigation and dynamic time of day means you have limitless possibilities. I chose a moody, rainy intermediate view that captures the dark essence of the original movies taking advantage of the Capture Mode.

Compute Shaders

The first thing to notice about the frame is that it is very compute-heavy. In the absence of markers, Renderdoc splits rendering into passes if there are more than one Draw or Dispatch commands targeting the same output buffers. According to the capture there are 15 compute vs 18 color/depth passes, i.e. it is broadly split into half compute, half draw techniques. Compute can be more flexible than draw (and, if done correctly, faster) but a lot of time has to be spent fine-tuning and balancing performance. Frontier clearly spared no expense developing the technology to get there, however this also means that analyzing a frame is a bit harder.

Grass Displacement

A big component of JWE is foliage and its interaction with cars, dinosaurs, wind, etc. To animate the grass, one of the very first processes populates a top-down texture that contains grass displacement information. This grass displacement texture is later read in the vertex shader of all the grass in the game, and the information used to modify the position of the vertices of each blade of grass. The texture wraps around as the camera moves and fills in the new regions that appear at the edges. This means that the texture doesn’t necessarily look like a top-down snapshot of the scene, but will typically be split into 4 quadrants. The process involves these steps:

  1. Render dinosaurs and cars, probably other objects such as the gyrospheres. This doesn’t need an accurate version of the geometry, e.g. cars only render wheels and part of the chassis, which is in contact with grass. The result is a top down depth buffer (leftmost image). If you squint you’ll see the profile of an ankylosaurus. The other dinosaurs aren’t rendered here, perhaps the engine knows they aren’t stepping on grass and optimizes them out.
  2. Take this depth buffer and a heightmap of the scene (center image), and output three quantities: a mask to tell whether the depth of the object was above/below the terrain, the difference in depth between them, and the actual depth and pack them in a 3-channel texture (rightmost image)



An additional process simulates wind. In this particular scene there is a general breeze from the storm plus a helicopter, both producing currents that displace grass. This is a top down texture similar to the one before containing motion vectors in 2D. The motion for the wind is an undulating texture meant to mimic wind waves which seems to have been computed on the CPU, and the influence of the helicopter is cleverly done blending a stream of particles on top of the first texture. You can see it in the image as streams pulling outward. Dinosaur and car motion is also blended here. I’m not entirely sure what the purpose of the repeating texture is (you can see the same objects repeated multiple times).

Continue reading

Rendering Line Lights

Within the arsenal of lights provided by game engines, the most popular are punctual lights such as point, spot or directional because they are cheap. On the other end, area lights have recently produced incredible techniques such as Linearly Transformed Cosines and other analytic approximations. I want to talk about the line light.

Update [04/09/2020] When I originally wrote the article there were no public images showing Jedi or lightsabers so I couldn’t make the connection (though a clever reader could have concluded what they might be for!) I can finally show this work off as it’s meant to be. You can also watch a gameplay trailer here.

In Unreal Engine 4, modifying ‘Source Length’ on a point light elongates it as described in this paper. It spreads the intensity along the length so a longer light becomes perceptually dimmer. Frostbite also has tube lights, a complex implementation of the analytical illuminance emitted by a cylinder and two spheres. Unity includes tube lights as well in their HD Render Pipeline (thanks Eric Heitz and Evegenii Golubev for pointing it out) based on their LTC theory, which you can find a great explanation and demos for here. Guerrilla Games’ Decima Engine has elongated quad lights using an approach for which they have a very attractive and thorough explanation in GPU Pro 5’s chapter II.1, Physically Based Area Lights. This is what I adapted to line lights.

Continue reading

The Rendering of Rise of the Tomb Raider

[latexpage]

Rise of the Tomb Raider (2015) is the sequel to the excellent Tomb Raider (2013) reboot. I personally find both refreshing as they move away from the stagnating original series and retell the Croft story. The game is story focused and, like its prequel, offers enjoyable crafting, hunting and climbing/exploring mechanics.

Tomb Raider used the Crystal Engine, developed by Crystal Dynamics also used in Deus Ex: Human Revolution. For the sequel a new engine called Foundation was used, previously developed for Lara Croft and the Temple of Osiris (2014). Its rendering can be broadly classified as a tiled light-prepass engine, and we’ll see what that means as we dive in. The engine offers the choice between a DX11 and DX12 renderer; I chose the latter for reasons we’ll see later. I used Renderdoc 1.2 to capture the frame, on a Geforce 980 Ti, and turned on all the bells and whistles.

The Frame

I can safely say without spoilers that in this frame bad guys chase Lara because she’s looking for an artifact they’re looking for too, a conflict of interest that absolutely must be resolved using weapons. Lara is inside the enemy base at nighttime. I chose a frame with atmospheric and contrasty lighting where the engine can show off.

Depth Prepass

A customary optimization in many games, a small depth prepass takes place here (~100 draw calls). The game renders the biggest objects (rather the ones that take up the most screen space), to take advantage of the Early-Z capability of GPUs. A concise article by Intel explains further. In short, the GPU can avoid running a pixel shader if it can determine it’s occluded behind a previous pixel. It’s a relatively cheap pass that will pre-populate the Z-buffer with depth.

An interesting thing I found is a level of detail (LOD) technique called ‘fizzle’ or ‘checkerboard’. It’s a common way to fade objects in and out at a distance, either to later replace it with a lower quality mesh or to completely make it disappear. Take a look at this truck. It seems to be rendering twice, but in reality it’s rendering a high LOD and a low LOD at the same position, each rendering to the pixels the other is not rendering to. The first LOD is 182226 vertices, whereas the second LOD is 47250. They’re visually indistinguishable at a distance, and yet one is 3 times cheaper. In this frame, LOD 0 has almost disappeared while LOD 1 is almost fully rendered. Once LOD 0 completely disappears, only LOD 1 will render.

A pseudorandom texture and a probability factor allow us to discard pixels that don’t pass a threshold. You can see this texture used in ROTR. You might be asking yourself why not use alpha blending. There are many disadvantages to alpha blending over fizzle fading.

  1. Depth prepass-friendly: By rendering it like an opaque object and puncturing holes, we can still render into the prepass and take advantage of early-z. Alpha blended objects don’t render into the depth buffer this early due to sorting issues.
  2. Needs extra shader(s): If you have a deferred renderer, your opaque shader doesn’t do any lighting. You need a separate variant that does if you’re going to swap an opaque object for a transparent one. Aside from the memory/complexity cost of having at least an extra shader for all opaque objects, they need to be accurate to avoid popping. There are many reasons why this is hard, but it boils down to the fact they’re now rendering through a different code path.
  3. More overdraw: Alpha blending can produce more overdraw and depending on the complexity of your objects you might find yourself paying a large bandwidth cost for LOD fading.
  4. Z-fighting: z-fighting is the flickering effect when two polygons render to a very similar depth such that floating point imprecision causes them to “take turns” to render. If we render two consecutive LODs by fading one out and the next one in, they might z-fight since they’re so close together. There are ways around it like biasing one over the other but it gets tricky.
  5. Z-buffer effects: Many effects like SSAO rely on the depth buffer. If we render transparent objects at the end of the pipeline when ambient occlusion has run already, we won’t be able to factor them in.

One disadvantage of this technique is that it can look worse than alpha fading, but a good noise pattern, post-fizzle blurring or temporal AA can hide it to a large extent. ROTR doesn’t do anything fancy in this respect.

Normals Pass

Crystal Dynamics uses a relatively unusual lighting scheme for its games that we’ll describe in the lighting pass. For now suffice it to say that there is no G-Buffer pass, at least not in the sense that other games have us accustomed to. Instead, the objects in this pass only output depth and normals information. Normals are written to an RGBA16_SNORM render target in world space. As a curiosity, this engine uses Z-up as opposed to Y-up which is what I see more often in other engines/modelling packages. The alpha channel contains glossiness, which will be decompressed later as exp2(glossiness * 12 + 1.0). The glossiness value can actually be negative, as the sign is used as a flag to indicate whether a surface is metallic or not. You can almost spot it yourself, as the darker colors in the alpha channel are all metallic objects.

R G B A
Normal.x Normal.y Normal.z Glossiness + Metalness

Normals
Glossiness/Metalness
 
previous arrow
next arrow

Continue reading

A Real Life Pinhole Camera

When I married last year, my wife and I went on our honeymoon to Thailand. Their king Bhumibol had died a month ago and the country was mourning. Everywhere we found good wishes and memorials, and people would dress in black and white as a sign of sorrow. The Thai are a gentle and polite people who like to help out; we’d ask for directions and people with little notions of English would spend twenty minutes understanding and answering our questions. Thailand has a rich history of rising and falling kingdoms, great kings and battles, and unification and invasions by foreign kingdoms. There are some amazing ruins of these kingdoms. Thailand also lives by a variant of Buddhism reflected in their beautiful temples. Some of the architectural features I found most interesting are the small reflective tiles that cover the outer walls, animal motives like the Garuda, (bird creatures that can be seen on the rooftops) and snake-like creatures called Naga It is in this unexpected context that I found a real-life pinhole camera. I always wear my graphics hat so I decided to capture it and later make a post.

First, a little background. A pinhole camera (also known as camera obscura after its latin name) is essentially the simplest camera you can come up with. If you conceptually imagine a closed box that has a single, minuscule hole in one of its faces, such that a single ray from each direction can come inside, you’d have a mirrored image at the inner face of the other side of the box to where the pinhole is. An image is worth more than a thousand explanations, so here’s what I’m talking about.

 

Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
 
previous arrow
next arrow

 

As you can see, the concept is simple. If you were inside the room, you’d see an inverted image of the outside. The hole is so small the room would be fairly dark so even the faint light now bouncing back towards you would still be visible. I made the pinhole a hexagon, as I wanted to suggest the fact that it is effectively the shutter of a modern camera. Louis Daguerre, one of the fathers of photography, used this model in his famous daguerreotype circa 1835, but Leonardo da Vinci had already described this phenomenon as an oculus artificialis (artificial eye) in one of his works in as early as 1502. There are plenty additional resources if you’re interested and even a pretty cool tutorial on how to create your own.

Now that we understand what this camera is, let’s look at the real image I encountered. I’ve aligned the inside and outside images I took and cast rays so you can see what I mean.

 

Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
 
previous arrow
next arrow

 

The image of the inside looks bright but I had to take it with 1 second of exposure and it still looks relatively dark. On top of that the day outside was very sunny which helped a lot in getting a clear “photograph”.

The Rendering of Middle Earth: Shadow of Mordor

Middle Earth: Shadow of Mordor was released in 2014. The game itself was a great surprise, and the fact that it was a spin-off within the storyline of the Lord of the Rings universe was quite unusual and it’s something I enjoyed. The game was a great success, and at the time of writing, Monolith has already released the sequel, Shadow of War. The game’s graphics are beautiful, especially considering it was a cross-generation game and was also released on Xbox 360 and PS3. The PC version is quite polished and features a few extra graphical options and hi-resolution texture packs that make it shine.

The game uses a relatively modern deferred DX11 renderer. I used Renderdoc to delve into the game’s rendering techniques. I used the highest possible graphical settings (ultra) and enabled all the bells and whistles like order-independent transparency, tessellation, screen-space occlusion and the different motion blurs.

The Frame

This is the frame we’ll be analyzing. We’re at the top of a wooden scaffolding in the Udun region. Shadow of Mordor has similar mechanics to games like Assassin’s Creed where you can climb buildings and towers and enjoy some beautiful digital scenery from them.

Depth Prepass

The first ~140 draw calls perform a quick prepass to render the biggest elements of the terrain and buildings into the depth buffer. Most things don’t end up appearing in this prepass, but it helps when you’ve got a very big number of draw calls and a far range of view. Interestingly the character, who is always in front and takes a decent amount of screen space, does not go into the prepass. As is common for many open world games, the game employs reverse z, a technique that maps the near plane to 1.0 and far plane to 0.0 for increased precision at great distances and to prevent z-fighting. You can read more about z-buffer precision here.

 

G-buffer

Right after that, the G-Buffer pass begins, with around ~2700 draw calls. If you’ve read my previous analysis for Castlevania: Lords of Shadow 2 or have read other similar articles, you’ll be familiar with this pass. Surface properties are written to a set of buffers that are read later on by lighting passes to compute its response to the light. Shadow of Mordor uses a classical deferred renderer, but uses a comparably small amount of G-buffer render targets (3) to achieve its objective. Just for comparison, Unreal Engine uses between 5 and 6 buffers in this pass. The G-buffer layout is as follows:

Normals Buffer
R G B A
Normal.x Normal.y Normal.z ID

The normals buffer stores the normals in world space, in 8-bit per channel format. This is a little bit tight, sometimes not enough to accurately represent smoothly varying flat surfaces, as can be seen in some puddles throughout the game if paying close attention. The alpha channel is used as an ID that marks different types of objects. Some that I’ve found correspond to a character (255), an animated plant or flag (128), and the sky is marked with ID 1, as it’s later used to filter it out during the bloom phase (it gets its own radial bloom).

World Space Normals
Object ID
 
previous arrow
next arrow

Continue reading

Photoshop Blend Modes Without Backbuffer Copy

For the past couple of weeks, I have been trying to replicate the Photoshop blend modes in Unity. It is no easy task; despite the advances of modern graphics hardware, the blend unit still resists being programmable and will probably remain fixed for some time. Some OpenGL ES extensions implement this functionality, but most hardware and APIs don’t. So what options do we have?

1) Backbuffer copy

A common approach is to copy the entire backbuffer before doing the blending. This is what Unity does. After that it’s trivial to implement any blending you want in shader code. The obvious problem with this approach is that you need to do a full backbuffer copy before you do the blending operation. There are certainly some possible optimizations like only copying what you need to a smaller texture of some sort, but it gets complicated once you have many objects using blend modes. You can also do just a single backbuffer copy and re-use it, but then you can’t stack different blended objects on top of each other. In Unity, this is done via a GrabPass. It is the approach used by the Blend Modes plugin.

2) Leveraging the Blend Unit

Modern GPUs have a little unit at the end of the graphics pipeline called the Output Merger. It’s the hardware responsible for getting the output of a pixel shader and blending it with the backbuffer. It’s not programmable, as to do so has quite a lot of complications (you can read about it here) so current GPUs don’t have one.

The blend mode formulas were obtained here and here. Use it as reference to compare it with what I provide. There are many other sources. One thing I’ve noticed is that provided formulas often neglect to mention that Photoshop actually uses modified formulas and clamps quantities in a different manner, especially when dealing with alpha. Gimp does the same. This is my experience recreating the Photoshop blend modes exclusively using a combination of blend unit and shaders. The first few blend modes are simple, but as we progress we’ll have to resort to more and more tricks to get what we want.

Two caveats before we start. First off, Photoshop blend modes do their blending in sRGB space, which means if you do them in linear space they will look wrong. Generally this isn’t a problem, but due to the amount of trickery we’ll be doing for these blend modes, many of the values need to go beyond the 0 – 1 range, which means we need an HDR buffer to do the calculations. Unity can do this by setting the camera to be HDR in the camera settings, and also setting Gamma for the color space in the Player Settings. This is clearly undesirable if you do your lighting calculations in linear space. In a custom engine you would probably be able to set this up in a different manner (to allow for linear lighting).

If you want to try the code out while you read ahead, download it here.

A) Darken

Formula min(SrcColor, DstColor)
Shader Output
Blend Unit Min(SrcColor · One, DstColor · One)

darken

As alpha approaches 0, we need to tend the minimum value to DstColor, by forcing SrcColor to be the maximum possible color float3(1, 1, 1)

B) Multiply

Formula SrcColor · DstColor
Shader Output
Blend Unit SrcColor · DstColor + DstColor · OneMinusSrcAlpha

multiply

Continue reading