The Rendering of Mafia: Definitive Edition

Mafia: Definitive Edition (2020) is a remake of the much-loved gangster classic Mafia (2002), originally released for PS2 and Xbox. The game is relatively linear and very story focused, whose narrative I personally found gripping and worthy of being compared to Scarface or Goodfellas. Hangar 13 use their own technology to take on open worlds and stories, previously used for Mafia III, to bring Tommy and the Salieri family to life. It is a DX11 deferred engine on PC, and RenderDoc 1.13 was used to capture and analyze.

The Frame

Tommy looks like he means business with his jacket and fedora, and thus our frame analysis begins. I chose a nighttime city scene as I find it more moody and challenging to get right. Let’s dive right in: I’ll make you a rendering offer you can’t refuse.

Depth Prepass

As we know, a depth prepass is often a careful balance between the time you spend doing it and the time you save by more effective occlusion. Objects seem to be relatively well selected and sorted with depth and size, as by drawcall 120 we actually have a lot of the biggest content in the depth buffer with very simple shaders. Subsequent drawcalls fail the depth test often after that, avoiding wasted work. There are some odd choices like the electricity wires which I assume have large bounding boxes, but most of it makes sense and probably costs little compared to what it saves.

GBuffer Pass

The GBuffer for Mafia packs quite a lot of information. The first texture contains normals and roughness, which is quite standard these days, in 16-bit floating point. While it’s a little large for my taste, normals tend to want as much bit-depth as possible, especially if no compression schemes are used.

R16FG16FB16FA16F
Normal.xNormal.yNormal.zRoughness

GBuffer Normals
GBuffer Roughness
 
previous arrow
next arrow

 

The second texture contains albedo and metalness in an 8-bit normalized format, which is also common for PBR engines and relevant if cars sport very reflective chrome components. As you can see, metallic parts are marked as white whereas mostly everything else is black (i.e. non-metal)

R8G8B8A8
Albedo.rAlbedo.gAlbedo.bMetalness

GBuffer Albedo
GBuffer Metalness
 
previous arrow
next arrow

 

The next texture contains packed quantities not easy to decode by inspection. RenderDoc has a neat feature, custom shaders, that will come to our aid. Searching the capture we come across the code for decoding these channels, and after adapting the D3D bytecode back to hlsl, displaying them on screen actually starts to make sense. The first 3 channels are motion vectors (including a z component which I find interesting), and the last channel is the vertex normal encoded in two 8 bit values (z is implicit). It’s interesting to note that vertex normals have only been given 2 bytes as opposed to the 6 bytes assigned to per-pixel normals. Vertex normals are an unusual thing to output, but we’ll soon find out why.

R16UG16UB16UA16U
MotionVector.xMotionVector.yMotionVector.zEncoded Vertex Normal

GBuffer Encoded Motion Vectors
GBuffer Decoded Motion Vectors
GBuffer Encoded Vertex Normal
GBuffer Decoded Vertex Normal
 
previous arrow
next arrow

 

The fourth texture contains miscellaneous quantities such as specular intensity, curvature and profile for subsurface scattering and flags. The G component is set to 0.5 so it may be an unused/spare channel for future usage.

R8G8B8A8
Specular Intensity0.5Curvature or Thickness (for SSS)SSS Profile
GBuffer Specular Intensity
GBuffer Curvature
GBuffer SSS Profile
 
previous arrow
next arrow

 

The last entry in the GBuffer is the emissive lighting, which becomes the main lighting buffer from now on.

R11FG11FB10F
Emissive.rEmissive.gEmissive.b

One interesting performance decision for the GBuffer is not clearing it at the start of the frame. Sometimes clearing a buffer is necessary, but you can avoid the cost if you’re going to overwrite the contents and you know where (by marking it in the stencil). There are other performance penalties involved in clearing depending on platform, so the gist of it is it’s never a bad idea to avoid clearing if you can.

Downscaled Depth and Normals

Downscaling depth and normals is a common technique for running expensive effects at lower resolution. As we’ll see later this is put to good use during the global illumination pass. For now we’ll mention that the game creates 3 mipmaps for depth and normals (both vertex and per-pixel). It also creates a downscaled texture containing edge information. These textures are useful for edge preservation when upsampling from a low resolution, which as we’ll see happens often. The edges are cleverly packed into a single R8_UNORM texture, which is very little memory for the information it stores. An implementation of this or a similar scheme is used in Intel’s Adaptive Screen Space Ambient Occlusion.

Mafia Depth Mip 2
Mafia Depth Mip 3
Mafia Vertex Normals Mip 2
Mafia Vertex Normals Mip 3
Mafia Edges Mip 3
Mafia Edges Mip 4
Mafia Pixel Normals Mip 2
Mafia Pixel Normals Mip 3
 
previous arrow
next arrow

 

Occlusion Culling

A typical Mafia scene is a relatively dense urban environment, and many objects are occluded by other large objects. Occlusion culling is the family of techniques that avoids processing them. One technique that other engines like UE4 have used to great success is called occlusion queries, and are a way to ask the GPU whether certain geometry is behind the depth buffer, and even how occluded they are as a pixel percentage. A couple of notes on these techniques:

  1. Small Delay: Queries happen on the GPU and it takes 2-3 frames for that information to propagate back to the CPU, depending on the implementation. This delay can cause objects to pop on screen if the transition is abrupt
  2. GPU Solution: Sometimes these queries can be used with a feature called predicated rendering, which bypasses those issues but loses visibility on the CPU side
  3. Overhead: These tests need to be as fast as possible, but even rasterizing just a few hundred boxes isn’t free, so occlusion testing can happen on small conservative depth buffers to make it as cheap as possible

There are drawcalls that look like rooms and big chunks of geometry which suggests that the engine may bucket objects in boxes. Another interesting fact is that queries are also done on the cascades of the shadow map. If objects are occluded from the light’s point of view we can avoid wasting performance.

 

Deferred Decals

Deferred decals is a common technique to add detail or modify material properties on surfaces. It is well suited to a deferred renderer because the pipeline already outputs the necessary quantities. There are many articles and presentations covering decals. Mafia decals are rasterized, instanced boxes. Certain objects lay down a stencil mask during the GBuffer pass so that decals don’t render on top of them (e.g. the character or the cars)

Decals Before
Decals Wireframe
Decals Stencil Mask
Decals After
 
previous arrow
next arrow

 

Stencil Composite

A common issue with decals is blending. For one, it sometimes requires equations that the blend unit cannot express (such as for normal blending), but also the alpha channel cannot simultaneously be a blend factor and a blendable quantity. To avoid these issues Mafia uses a clever trick: some decals render into intermediate decal buffers and sample from the original buffer output by the GBuffer pass. After decals have been rendered, a composite pass puts the both buffers back together, using its alpha channel as the blend factor. During rendering decals write a stencil value, to avoid a fullscreen copy and only combine the relevant parts back into the original GBuffer, which makes it a very scalable technique.

GBuffer Albedo
Decal Buffer Albedo
Decal Buffer Alpha
Decal Stencil
Decal Albedo Final
GBuffer Normals
Decal Buffer Normals
Decal Normal Final
 
previous arrow
next arrow

 

Global Illumination

One thing the Mafia engine does really well is realtime global illumination. GI is the process of sampling the lighting environment around a given surface and integrating the result (i.e. adding all rays and applying certain weights to each). This is often impractical for realtime, so most solutions distribute rays spatially and/or temporally, and reuse those results by blurring and/or accumulating over time.

Stochastic Sampling

This works by randomly distributing vectors around the main normal in a manner consistent with the surface properties and type of illumination. For example, sampling vectors for diffuse illumination look like a noisy version of the original vertex normals, as they are centered around the normal and distributed uniformly around the hemisphere.

Vertex Normals
Stochastic Vertex Normals
 
previous arrow
next arrow

 

For specular, they are centered around the reflection vector produced by the view vector and the per-pixel normal. Specular reflections are more concentrated around the reflection vector for smoother surfaces according to the BRDF. In the extreme (roughness is zero, such as the car) the random vector is the reflection vector.

Pixel Normals
Stochastic Reflection Vectors
 
previous arrow
next arrow

GI in Mafia is a combination of screen space and “statically” computed lighting. Every ray cast will try to find a source on screen first, and fall back to a volumetric structure. The source for on screen lighting is the previous frame, using the motion vectors for reprojection. For diffuse, the volumetric fallback is a volume structure containing low frequency lighting around the camera. For specular, the fallback is a cascade of cubemaps that has both depth and lighting information and gets raymarched in the same manner as the screen depth buffer.

The process works at 3 resolutions and cleverly upscales in successive iterations to mitigate the cost, and the last step includes temporal accumulation. Both diffuse and specular conceptually work in the same way. A sequence of images is worth more than a lengthy explanation.

Diffuse GI Mip 1
Diffuse GI Mip 2
Diffuse GI Mip 3
Diffuse GI Mip 3 Blur H
Diffuse GI Mip 3 Blur
Diffuse GI Mip 2 Upscale Blur
Diffuse GI Mip 1 Upscale Blur
Diffuse GI Final
 
previous arrow
next arrow

 

Reflections Mip 1
Reflections Mip 2
Reflections Mip 3
Reflections Mip 3 Blur H
Reflections Mip 3 Blur
Reflections Mip 2 Upscale Blur
Reflections Mip 1 Upscale Blur
Reflections Final
 
previous arrow
next arrow

 

GI Update

As we have mentioned already, GI works partly in screen space and partly uses fallback structures. A process at the beginning of the frame incrementally updates them. The first step that happens across many frames is cubemap capturing. Cubemaps are captured around the player as they traverse the level, containing both depth and lighting, and there are other textures that provide extra information.

The cubemaps are also preprocessed to produce volume textures that represent what looks like outgoing radiance extracted from those cubemaps and a main direction vector. Other textures look like they might encode some form of light leaking prevention. In any case, it is this volume structure that diffuse rays fall back to when they miss the screen. In the case of specular reflections, the ray is traced directly through the cubemap depth buffer until an intersection is found. For more details on this process, Hangar 13’s Martin Sobek published a detailed GDC presentation.

Ambient Occlusion

SSAO

Screen Space Ambient Occlusion is a standard technique so we won’t go into much detail about it. It seems to be used for relatively short range occlusion in general, with the radius constant in screen space (this helps capture detail in the distance even if it looks “larger”)

Car Occlusion

It is hard to get ambient occlusion from the underside of things from a screen space technique, so Mafia takes the oldest trick in the book which is to darken the ambient lighting using a texture, in a manner not too dissimilar to decals. It is car-specific and not used for anything else. I have overlaid the wireframe in the shadow shot so you can see clearly how the shadow relates to the car.

Direct Illumination

Unlike other games using tiled or clustered lighting, Mafia instead uses classic deferred techniques with some tricks worth mentioning. In both day and night a standard directional light is present.

Screen Space Contact Shadows

A known issue with standard shadow mapping is the difficulty to get shadows that perfectly join at the contact point between two surfaces. Typical artifacts in this situation are:

  1. Peter-Panning: Sometimes developers who add a small bias to avoid shadow self-intersection artifacts will cause another undesired effect where shadows look detached from an object and the object looks like it’s floating
  2. No Contact: If the engine has soft shadows, the radius is often applied with no regards to the distance between the occluder and the receiver. Therefore even at the surface boundary the shadow will look soft and not grounded
  3. Shadow Resolution: If the target performance isn’t reached, developers often compromise on shadow map resolution which of course impact on how crisp the shadow result can be

For these reasons contact shadow techniques were developed. It is yet another screen space raymarching solution where a ray is cast from the depth buffer in the direction of the light until a suitable intersection is reached.

Parallel Split Shadows

Parallel split shadows is also fairly standard. Mafia renders several cascades into a 2048×2048 texture array. The cascades are resolved incrementally onto the main shadow mask buffer using stencil and depth trickery to discard pixels outside the cascade range quickly. The closest cascade is sampled with a lot of detail whereas the further cascades are sampled with less detail. The result is combined with the contact shadows.

Contact Shadows
Contact Shadows + Cascade 2
Shadows Cascade 0
Shadows Complete
 
previous arrow
next arrow

The typical arsenal of point and spot lights is also available. They are rendered in two ways depending on their screen size:

  • For lights that take up a lot of screen real estate, the venerable stencil mask technique is used. This technique sets up a stencil mask of the pixels a light touches and then renders the light, writing only to that region taking advantage of early stencil. An extra optimization is that the mask is prepared for several lights up front, then rendered for all those lights
  • For smaller lights it wasn’t worth the overhead of creating the stencil mask. A quad covering the screen area of the light is used instead. I’m not too sold on this as it feels like many pixels have 0 contribution adding cost to the frame where a tighter shape could have worked better
Bounce Light Only
Dynamic Lights 1
Dynamic Lights 2
Dynamic Lights Directional
Dynamic Lights Directional 2
Dynamic Lights 3
Stencil Mask 1
Stencil Mask 2
Stencil Mask 3
Stencil Mask Result
Stencil Mask Result 2
Dynamic Lights 4
Dynamic Lights 5
Dynamic Lights 6
 
previous arrow
next arrow
Character Shadows

Characters have their own shadow maps. A custom shader that renders the character geometry and only samples the character shadow map is composited on top of the shadow mask blended with a min operator to add finer detail. Notice the crisp shadows under the hat and the jacket lapels.

Subsurface Scattering

After lighting, subsurface scattering kicks in. It’s a subtle effect so we’ll zoom in. The implementation is most likely SSSSS by Jorge Jiménez which has become pretty standard. It is essentially a bilateral screen-space Gaussian blur with carefully tuned weights derived from skin profiles whose width can vary with thickness/curvature values from the GBuffer. The blur only happens on the diffuse component of the lighting, so diffuse and specular are separated, then composited back.

Diffuse Lighting
SSS Stencil
 
previous arrow
next arrow

 

For optimization a stencil mask is produced so the shader only blurs where strictly necessary and the cost scales. SSS can be more expensive in a cutscene but during gameplay the screen area is tiny and the effect likely has a negligible impact.

Clouds

Mafia sports an unusual solution for clouds. It starts off by rendering a big dome-shaped object, with depth testing turned on, creating a stencil mask so that the later shader only writes to the visible sky pixels. In that shader, 3 textures are sampled, a couple of 3D textures with 8 slices unwrapped as 2D textures, and an actual 3D noise texture. The choice of 2D textures emulating a 3D texture is a recurring pattern as we’ll see later. One of the 8 slices in the texture is generated every frame to do the cloud simulation, i.e. every 8 frames a cloud cycle is completed.

Cloud Density
Cloud Height
 
previous arrow
next arrow

 

The three textures are combined to create a fullscreen cloud texture containing a cloud mask for the presence of clouds, plus single scattering and multiscattering in the other two channels. This texture is plugged in later when compositing with the entire sky.

Cloud Mask Wireframe
Cloud Mask
Cloud Single Scattering
Cloud Multi Scattering
 
previous arrow
next arrow

 

Atmospheric Sky+ Stars

Mafia has an atmospheric simulation going on, and can set the environment to different weather and time conditions. One of the first steps uses a series of precomputed inscattering and outscattering textures to produce the Rayleigh and Mie scattering for the sky, at low resolution.

Rayleigh

Mie

The sky generation step takes the previous cloud data and blends it with an upsampled version of the Rayleigh and Mie textures depending on the time of day and weather conditions, occluding clouds correctly by fog. This step is also accelerated by the stencil buffer, avoiding computations in the foreground. At nighttime there’s an extra step going on; to render the starfield, stars are rendered as little quads on screen with a small glowing texture applied to them. They are also correctly occluded by the clouds and sky.

Mafia Sky Mask
Mafia Sky
Mafia Starfield Stencil
Mafia Sky Final
 
previous arrow
next arrow

 

Volumetric Fog

For volumetric fog, Mafia uses the unwrapped volume texture technique again, although this time there’s an interesting trick taking advantage of it. The idea is to “rasterize” a volumetric shape by slicing it into quads. Each quad renders to a slice in the volume, so e.g. you can split a spotlight into multiple slices and render them in one instanced drawcall. The outputs are both the radiance of the light at a given position and what looks like the position itself from the light’s point of view.

Spotlight 1
Spotlight 1
Spotlight 2
Spotlight 2
 
previous arrow
next arrow

 

As we’ve seen already Mafia loves their stencil buffer so here’s another interesting trick that I think sells this 2D texture emulation on a 3D texture. In a volume texture that is sliced in the depth from the screen, parts of slices are going to be hidden by geometry in front (like the car in this image). By marking occluded pixels we can avoid wasted computations.

This texture is noisy so a post-blur pass is performed on it. This blur helps hide the noise, but also helps with temporal stability, as this volume texture is low resolution compared to the screen (256x192x84). The blur also uses the stencil trickery mentioned above.

The volume texture is then combined with an actual atmospheric fog simulation (bluish color in the image above) and the result overlaid on top of the lit buffer.

Before Fog
After Fog
 
previous arrow
next arrow

 

Transparency and Glows

Transparent objects that need reflections are rendered twice. First they’re output onto an offscreen buffer at half resolution, rendering the closest reflection vectors and depth, which are then used to trace reflections. The resulting reflection texture is later read by the actual full resolution transparency pass and composited back. Smoke and glows are also included in this pass.

Transparency Reflection Vectors
Transparency Reflections
Transparency
Transparency Composite
City Glows
Car Glows
Transparency + Glows
 
previous arrow
next arrow

 

Temporal AA

Mafia’s antialiasing solution is Temporal AA, which has become relatively standard these days. It has several typical characteristics such as an accumulation buffer and uses motion vectors to access the previous frame’s contents as described here. It creates a disocclusion texture to mitigate trailing and also attempts to remove very bright pixel outliers with a combination of a downscaled HDR texture and the luminance of the scene.

TAA Input
TAA Output
 
previous arrow
next arrow

 

Post-Processing

There is a single big shader that composites camera effects such as tonemapping, exposure correction, bloom, film grain, dirt, etc. We’ll go over some but they are fairly standard.

Bloom happens by blurring a thresholded HDR buffer. Instead of concatenating successive blurs and upscaling as other implementations do, each blur is done independently on a downscaled buffer (mips 1-4) and then combined back.

Bloom Mip 1
Bloom Mip 1 H
Bloom Mip 1 HV
Bloom Mip 2
Bloom Mip 2 H
Bloom Mip 2 HV
Bloom Mip 3
Bloom Mip 3 H
Bloom Mip 3 HV
Bloom Mip 4
Bloom Mip 4 H
Bloom Mip 4 HV
Bloom Full
 
previous arrow
next arrow

 

Film grain is not unusual for games that want to have a Hollywood look or imitate older film, and Mafia is a good candidate given its setting. A simple noise texture is applied on top of the entire image and shifted over time, to mimic sensor noise on a dark night. Tonemapping is done using a color cube as we saw in Shadow of Mordor, and vignette has an interesting little quirk. Instead of adding a fullscreen pass, it renders a squashed octogon in the middle of the screen with the bounds of the vignette. From there the pixel shader derives the intensity of the effect. I think it’s mainly used when you’re injured, and meant to go dark red representing blood. Screen dirt is added on top as well, and this technically finishes the frame.

Before Post
After Post
Vignette Wireframe
Screen Dirt
 
previous arrow
next arrow

 

UI

The UI is rendered directly on top of the swapchain at the end of the frame. It’s all pretty standard here except for the rendering of the realtime minimap. As has become a tradition already, stencil is used to mark the region of interest, and then flat geometry is rendered on top representing streets, buildings, routes, etc. After that a series of antialiased borders are rendered to soften the edges and small icons like the car, etc are overlaid on top.

UI Map Rendering
UI Map Rendering 2
UI Final
 
previous arrow
next arrow

 

Closing Remarks

With this our analysis ends, and hopefully you’ve enjoyed it. Mafia not only is a great game, it also looks really well and we now have a small insight into how it was done. As always, if this leaves you with appetite for more analyses, Adrian Courrèges kindly keeps a repository of many other game studies.

A Macro View of Nanite

After showing an impressive demo last year and unleashing recently with the UE5 preview, Nanite is all the rage these days. I just had to go in and have some fun trying to figure it out and explain how I think it operates and the technical decisions behind it using a renderdoc capture. Props to Epic for being open with their tech which makes it easier to learn and pick apart; the editor has markers and debug information that are going to be super helpful.

This is the frame we’re going to be looking at, from the epic showdown in the Valley of the Ancient demo project. It shows the interaction between Nanite and non-Nanite geometry and it’s just plain badass.

Nanite::CullRasterize

The first stage in this process is Nanite::CullRasterize, and it looks like this. In a nutshell, this entire pass is responsible for culling instances and triangles and rasterizing them. We’ll refer to it as we go through the capture.

Instance Culling

Instance culling is one of the first things that happens here. It looks to be a GPU form of frustum and occlusion culling. There is instance data and primitive data bound here, I guess it means it culls at the instance level first, and if the instance survives it starts culling at a finer-grained level. The Nanite.Views buffer provides camera info for frustum culling, and hierarchical depth buffer (HZB) is used for occlusion culling. The HZB is sourced from the previous frame and forward projected to this one. I’m not sure how it deals with dynamic objects, it may be that it uses such a large mip (small resolution) that it is conservative enough. EDIT: According to the Nanite paper, the HZB is generated this frame with the previous frame’s visible objects. The HZB is tested with the previous objects as well as anything new and visibility updated for the next frame.

Both visible and non-visible instances are written into buffers. For the latter I’m thinking this is the way of doing what occlusion queries used to do in the standard mesh pipeline: inform the CPU that a certain entity is occluded and it should stop processing until it becomes visible. The visible instances are also written out into a list of candidates.

Persistent Culling

Persistent culling seems to be related to streaming. It is a fixed number of compute threads, suggesting it is unrelated to the complexity of the scene and instead maybe checks some spatial structure for occlusion. This is one complicated shader, but based on the inputs and outputs we can see it writes out how many triangle clusters are visible of each type (compute and traditional raster) into a buffer called MainRasterizeArgsSWHW (SW:compute, HW:raster).

Clustering and LODding

It’s worth mentioning LODs at this point as it is probably around here where those decisions are made. Some people speculated geometry images as a way to do continuous LODding but I see no indication of this. Triangles are grouped into patches called clusters, and some amount of culling is done at the cluster level. The clustering technique has been described before in papers by Ubisoft and Frostbite. For LODs, clusters start appearing and disappearing as the level of detail descends within instances. Some very clever magical incantations are employed here that ensure all the combinations of clusters stitch into each other seamlessly.

Continue reading

The Rendering of Jurassic World: Evolution

Jurassic World: Evolution is the kind of game many kids (and adult-kids) dreamed of for a long time. What’s not to like about a game that gives you the reins of a park where the main attractions are 65-million-year-old colossal beasts? This isn’t the first successful amusement park game by Frontier Developments, but it’s certainly not your typical one. Frontier is a proud developer of their Cobra technology, which has been evolving since 1988. For JWE in particular it is a DX11 tiled forward renderer. For the analysis I used Renderdoc and turned on all the graphics bells and whistles. Welcome… to Jurassic Park.

The Frame

It’s hard to decide to present as a frame for this game, because free navigation and dynamic time of day means you have limitless possibilities, from a bird’s eye view to an extreme closeup of the dinosaurs, a sunset, a bright day or a hurricane. I chose a moody, rainy intermediate view that captures the dark essence of the original movies taking advantage of the Capture Mode introduced in version 1.7.

Compute Shaders

The first thing to notice about the frame is that it is very compute-heavy. In the absence of markers, Renderdoc splits rendering into passes if there are more than one Draw or Dispatch commands targeting the same output buffers. According to the capture there are 15 compute vs 18 color/depth passes, i.e. it is broadly split into half compute, half draw techniques. Compute can be more flexible than draw (and, if done correctly, faster) but a lot of time has to be spent fine-tuning and balancing performance. Frontier clearly spared no expense developing the technology to get there, however this also means that analyzing a frame is a bit harder.

Grass Displacement

A big component of JWE is foliage and its interaction with cars, dinosaurs, wind, etc. To animate the grass, one of the very first processes populates a top-down texture that contains grass displacement information. This grass displacement texture is later read in the vertex shader of all the grass in the game, and the information used to modify the position of the vertices of each blade of grass. The texture wraps around as the camera moves and fills in the new regions that appear at the edges. This means that the texture doesn’t necessarily look like a top-down snapshot of the scene, but will typically be split into 4 quadrants. The process involves these steps:

  1. Render dinosaurs and cars, probably other objects such as the gyrospheres. This doesn’t need an accurate version of the geometry, e.g. cars only render wheels and part of the chassis, which is in contact with grass. The result is a top down depth buffer (leftmost image). If you squint you’ll see the profile of an ankylosaurus. The other dinosaurs aren’t rendered here, perhaps the engine knows they aren’t stepping on grass and optimizes them out.
  2. Take this depth buffer and a heightmap of the scene (center image), and output three quantities: a mask to tell whether the depth of the object was above/below the terrain, the difference in depth between them, and the actual depth and pack them in a 3-channel texture (rightmost image)



An additional process simulates wind. In this particular scene there is a general breeze from the storm plus a helicopter, both producing currents that displace grass. This is a top down texture similar to the one before containing motion vectors in 2D. The motion for the wind is an undulating texture meant to mimic wind waves which seems to have been computed on the CPU, and the influence of the helicopter is cleverly done blending a stream of particles on top of the first texture. You can see it in the image as streams pulling outward. Dinosaur and car motion is also blended here. I’m not entirely sure what the purpose of the repeating texture is (you can see the same objects repeated multiple times).

Continue reading

Rendering Line Lights

Within the arsenal of lights provided by game engines, the most popular are punctual lights such as point, spot or directional because they are cheap. On the other end, area lights have recently produced incredible techniques such as Linearly Transformed Cosines and other analytic approximations. I want to talk about the line light.

Update [04/09/2020] When I originally wrote the article there were no public images showing Jedi or lightsabers so I couldn’t make the connection (though a clever reader could have concluded what they might be for!) I can finally show this work off as it’s meant to be.

In Unreal Engine 4, modifying ‘Source Length’ on a point light elongates it as described in this paper. It spreads the intensity along the length so a longer light becomes perceptually dimmer. Frostbite also has tube lights, a complex implementation of the analytical illuminance emitted by a cylinder and two spheres. Unity includes tube lights as well in their HD Render Pipeline (thanks Eric Heitz and Evegenii Golubev for pointing it out) based on their LTC theory, which you can find a great explanation and demos for here. Guerrilla Games’ Decima Engine has elongated quad lights using an approach for which they have a very attractive and thorough explanation in GPU Pro 5’s chapter II.1, Physically Based Area Lights. This is what I adapted to line lights.

Continue reading

The Rendering of Rise of the Tomb Raider

Rise of the Tomb Raider (2015) is the sequel to the excellent Tomb Raider (2013) reboot. I personally find both refreshing as they move away from the stagnating original series and retell the Croft story. The game is story focused and, like its prequel, offers enjoyable crafting, hunting and climbing/exploring mechanics.

Tomb Raider used the Crystal Engine, developed by Crystal Dynamics also used in Deus Ex: Human Revolution. For the sequel a new engine called Foundation was used, previously developed for Lara Croft and the Temple of Osiris (2014). Its rendering can be broadly classified as a tiled light-prepass engine, and we’ll see what that means as we dive in. The engine offers the choice between a DX11 and DX12 renderer; I chose the latter for reasons we’ll see later. I used Renderdoc 1.2 to capture the frame, on a Geforce 980 Ti, and turned on all the bells and whistles.

The Frame

I can safely say without spoilers that in this frame bad guys chase Lara because she’s looking for an artifact they’re looking for too, a conflict of interest that absolutely must be resolved using weapons. Lara is inside the enemy base at nighttime. I chose a frame with atmospheric and contrasty lighting where the engine can show off.

Depth Prepass

A customary optimization in many games, a small depth prepass takes place here (~100 draw calls). The game renders the biggest objects (rather the ones that take up the most screen space), to take advantage of the Early-Z capability of GPUs. A concise article by Intel explains further. In short, the GPU can avoid running a pixel shader if it can determine it’s occluded behind a previous pixel. It’s a relatively cheap pass that will pre-populate the Z-buffer with depth.

An interesting thing I found is a level of detail (LOD) technique called ‘fizzle’ or ‘checkerboard’. It’s a common way to fade objects in and out at a distance, either to later replace it with a lower quality mesh or to completely make it disappear. Take a look at this truck. It seems to be rendering twice, but in reality it’s rendering a high LOD and a low LOD at the same position, each rendering to the pixels the other is not rendering to. The first LOD is 182226 vertices, whereas the second LOD is 47250. They’re visually indistinguishable at a distance, and yet one is 3 times cheaper. In this frame, LOD 0 has almost disappeared while LOD 1 is almost fully rendered. Once LOD 0 completely disappears, only LOD 1 will render.

A pseudorandom texture and a probability factor allow us to discard pixels that don’t pass a threshold. You can see this texture used in ROTR. You might be asking yourself why not use alpha blending. There are many disadvantages to alpha blending over fizzle fading.

  1. Depth prepass-friendly: By rendering it like an opaque object and puncturing holes, we can still render into the prepass and take advantage of early-z. Alpha blended objects don’t render into the depth buffer this early due to sorting issues.
  2. Needs extra shader(s): If you have a deferred renderer, your opaque shader doesn’t do any lighting. You need a separate variant that does if you’re going to swap an opaque object for a transparent one. Aside from the memory/complexity cost of having at least an extra shader for all opaque objects, they need to be accurate to avoid popping. There are many reasons why this is hard, but it boils down to the fact they’re now rendering through a different code path.
  3. More overdraw: Alpha blending can produce more overdraw and depending on the complexity of your objects you might find yourself paying a large bandwidth cost for LOD fading.
  4. Z-fighting: z-fighting is the flickering effect when two polygons render to a very similar depth such that floating point imprecision causes them to “take turns” to render. If we render two consecutive LODs by fading one out and the next one in, they might z-fight since they’re so close together. There are ways around it like biasing one over the other but it gets tricky.
  5. Z-buffer effects: Many effects like SSAO rely on the depth buffer. If we render transparent objects at the end of the pipeline when ambient occlusion has run already, we won’t be able to factor them in.

One disadvantage of this technique is that it can look worse than alpha fading, but a good noise pattern, post-fizzle blurring or temporal AA can hide it to a large extent. ROTR doesn’t do anything fancy in this respect.

Normals Pass

Crystal Dynamics uses a relatively unusual lighting scheme for its games that we’ll describe in the lighting pass. For now suffice it to say that there is no G-Buffer pass, at least not in the sense that other games have us accustomed to. Instead, the objects in this pass only output depth and normals information. Normals are written to an RGBA16_SNORM render target in world space. As a curiosity, this engine uses Z-up as opposed to Y-up which is what I see more often in other engines/modelling packages. The alpha channel contains glossiness, which will be decompressed later as exp2(glossiness * 12 + 1.0). The glossiness value can actually be negative, as the sign is used as a flag to indicate whether a surface is metallic or not. You can almost spot it yourself, as the darker colors in the alpha channel are all metallic objects.

RGBA
Normal.xNormal.yNormal.zGlossiness + Metalness

Normals
Glossiness/Metalness
 
previous arrow
next arrow

 
Continue reading

A real life pinhole camera

When I got married last year, me and my wife went on our honeymoon to Thailand. Their king Bhumibol had died just a month ago and the whole country was mourning, so everywhere we found memorials and good wishes for their king, and people would dress in black and white as a sign of sorrow. The Thai are a gentle and polite people, who like to help out; we’d ask for directions and people with no notions of English would spend twenty minutes trying to understand and answer our questions. Thailand has a rich history of rising and falling kingdoms, great kings and battles, and unification and invasions by foreign kingdoms. There are some amazing ruins of these kingdoms. Thailand also lives by a variant of Buddhism reflected in all of their beautiful temples. Some of the architectural features I found most interesting are the small reflective tiles that cover the outer walls, animal motives like the Garuda, (bird creatures that can be seen on the rooftops) and snake-like creatures called Naga It is in this unexpected context that I found a real-life pinhole camera. I always wear my graphics hat so I decided to capture it and later make a post.

First, a little background. A pinhole camera (also known as camera obscura after its latin name) is essentially the simplest camera you can come up with. If you conceptually imagine a closed box that has a single, minuscule hole in one of its faces, such that a single ray from each direction can come inside, you’d have a mirrored image at the inner face of the other side of the box to where the pinhole is. An image is worth more than a thousand explanations, so here’s what I’m talking about.

 

Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
 
previous arrow
next arrow

 

As you can see, the concept is simple. If you were inside the room, you’d see an inverted image of the outside. The hole is so small the room would be fairly dark so even the faint light now bouncing back towards you would still be visible. I made the pinhole a hexagon, as I wanted to suggest the fact that it is effectively the shutter of a modern camera. Louis Daguerre, one of the fathers of photography, used this model in his famous daguerreotype circa 1835, but Leonardo da Vinci had already described this phenomenon as an oculus artificialis (artificial eye) in one of his works in as early as 1502. There are plenty additional resources if you’re interested and even a pretty cool tutorial on how to create your own.

Now that we understand what this camera is, let’s look at the real image I encountered. I’ve aligned the inside and outside images I took and cast rays so you can see what I mean.

 

Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
 
previous arrow
next arrow

 

The image of the inside looks bright but I had to take it with 1 second of exposure and it still looks relatively dark. On top of that the day outside was very sunny which helped a lot in getting a clear “photograph”.

The Rendering of Middle Earth: Shadow of Mordor

Middle Earth: Shadow of Mordor was released in 2014. The game itself was a great surprise, and the fact that it was a spin-off within the storyline of the Lord of the Rings universe was quite unusual and it’s something I enjoyed. The game was a great success, and at the time of writing, Monolith has already released the sequel, Shadow of War. The game’s graphics are beautiful, especially considering it was a cross-generation game and was also released on Xbox 360 and PS3. The PC version is quite polished and features a few extra graphical options and hi-resolution texture packs that make it shine.

The game uses a relatively modern deferred DX11 renderer. I used Renderdoc to delve into the game’s rendering techniques. I used the highest possible graphical settings (ultra) and enabled all the bells and whistles like order-independent transparency, tessellation, screen-space occlusion and the different motion blurs.

The Frame

This is the frame we’ll be analyzing. We’re at the top of a wooden scaffolding in the Udun region. Shadow of Mordor has similar mechanics to games like Assassin’s Creed where you can climb buildings and towers and enjoy some beautiful digital scenery from them.

Depth Prepass

The first ~140 draw calls perform a quick prepass to render the biggest elements of the terrain and buildings into the depth buffer. Most things don’t end up appearing in this prepass, but it helps when you’ve got a very big number of draw calls and a far range of view. Interestingly the character, who is always in front and takes a decent amount of screen space, does not go into the prepass. As is common for many open world games, the game employs reverse z, a technique that maps the near plane to 1.0 and far plane to 0.0 for increased precision at great distances and to prevent z-fighting. You can read more about z-buffer precision here.

 

G-buffer

Right after that, the G-Buffer pass begins, with around ~2700 draw calls. If you’ve read my previous analysis for Castlevania: Lords of Shadow 2 or have read other similar articles, you’ll be familiar with this pass. Surface properties are written to a set of buffers that are read later on by lighting passes to compute its response to the light. Shadow of Mordor uses a classical deferred renderer, but uses a comparably small amount of G-buffer render targets (3) to achieve its objective. Just for comparison, Unreal Engine uses between 5 and 6 buffers in this pass. The G-buffer layout is as follows:

Normals Buffer
RGBA
Normal.xNormal.yNormal.zID

The normals buffer stores the normals in world space, in 8-bit per channel format. This is a little bit tight, sometimes not enough to accurately represent smoothly varying flat surfaces, as can be seen in some puddles throughout the game if paying close attention. The alpha channel is used as an ID that marks different types of objects. Some that I’ve found correspond to a character (255), an animated plant or flag (128), and the sky is marked with ID 1, as it’s later used to filter it out during the bloom phase (it gets its own radial bloom).

World Space Normals
Object ID
 
previous arrow
next arrow

Continue reading

Photoshop Blend Modes Without Backbuffer Copy

For the past couple of weeks, I have been trying to replicate the Photoshop blend modes in Unity. It is no easy task; despite the advances of modern graphics hardware, the blend unit still resists being programmable and will probably remain fixed for some time. Some OpenGL ES extensions implement this functionality, but most hardware and APIs don’t. So what options do we have?

1) Backbuffer copy

A common approach is to copy the entire backbuffer before doing the blending. This is what Unity does. After that it’s trivial to implement any blending you want in shader code. The obvious problem with this approach is that you need to do a full backbuffer copy before you do the blending operation. There are certainly some possible optimizations like only copying what you need to a smaller texture of some sort, but it gets complicated once you have many objects using blend modes. You can also do just a single backbuffer copy and re-use it, but then you can’t stack different blended objects on top of each other. In Unity, this is done via a GrabPass. It is the approach used by the Blend Modes plugin.

2) Leveraging the Blend Unit

Modern GPUs have a little unit at the end of the graphics pipeline called the Output Merger. It’s the hardware responsible for getting the output of a pixel shader and blending it with the backbuffer. It’s not programmable, as to do so has quite a lot of complications (you can read about it here) so current GPUs don’t have one.

The blend mode formulas were obtained here and here. Use it as reference to compare it with what I provide. There are many other sources. One thing I’ve noticed is that provided formulas often neglect to mention that Photoshop actually uses modified formulas and clamps quantities in a different manner, especially when dealing with alpha. Gimp does the same. This is my experience recreating the Photoshop blend modes exclusively using a combination of blend unit and shaders. The first few blend modes are simple, but as we progress we’ll have to resort to more and more tricks to get what we want.

Two caveats before we start. First off, Photoshop blend modes do their blending in sRGB space, which means if you do them in linear space they will look wrong. Generally this isn’t a problem, but due to the amount of trickery we’ll be doing for these blend modes, many of the values need to go beyond the 0 – 1 range, which means we need an HDR buffer to do the calculations. Unity can do this by setting the camera to be HDR in the camera settings, and also setting Gamma for the color space in the Player Settings. This is clearly undesirable if you do your lighting calculations in linear space. In a custom engine you would probably be able to set this up in a different manner (to allow for linear lighting).

If you want to try the code out while you read ahead, download it here.

A) Darken

Formulamin(SrcColor, DstColor)
Shader Output
Blend UnitMin(SrcColor · One, DstColor · One)

darken

As alpha approaches 0, we need to tend the minimum value to DstColor, by forcing SrcColor to be the maximum possible color float3(1, 1, 1)

B) Multiply

FormulaSrcColor · DstColor
Shader Output
Blend UnitSrcColor · DstColor + DstColor · OneMinusSrcAlpha

multiply

Continue reading

The Rendering of Castlevania: Lords of Shadow 2

Castlevania Lords of Shadow 2 was released in 2014, a sequel that builds on top of Lords of Shadow, its first installment, which uses a similar engine. I hold these games dear and, being Spanish myself, I’m very proud of the work MercurySteam, a team from Madrid, did on all three modern reinterpretations of the Castlevania series (Lords of Shadow, Mirror of Fate and Lords of Shadow 2). Out of curiosity and pure fandom for the game I decided to peek into the Mercury Engine. Despite the first Lords of Shadow being, without shadow of a doubt (no pun intended), the best and most enjoyable of the new Castlevanias, out of justice for their hard work I decided to analyze a frame from their latest and most polished version of the engine. Despite being a recent game, it uses DX9 as graphics backend. Many popular tools like RenderDoc or the newest tools by Nvidia and AMD don’t support DX9, so I used Intel Graphics Analyzer to capture and analyze all the images and code from this post. While having a bit of graphics parlance, I’ve tried to include as many images as possible, with occasional code and in-depth explanations.

Analyzing a Frame

This is the frame we’re going to be looking at. It’s the beginning scene of Lords of Shadow 2, Dracula has just awakened, enemies are knocking at his door and he is not in the best mood.

CLOS2 Castle Final Frame

Depth Pre-pass

LoS2 appears to do what is called a depth pre-pass. What it means is you send the geometry once through the pipeline with very simple shaders, and pre-emptively populate the depth buffer. This is useful for the next pass (Gbuffer), as it attempts to avoid overdraw, so pixels with a depth value higher than the one already in the buffer (essentially, pixels that are behind) get discarded before they run the pixel shader, therefore minimizing pixel shader runs at the cost of extra geometry processing. Alpha tested geometry, like hair and a rug with holes, are also included in the pre-pass. LoS2 uses both the standard depth buffer and a depth-as-color buffer to be able to sample the depth buffer as a texture in a later stage.

The game also takes the opportunity to fill in the stencil buffer, an auxiliary buffer that is part of the depth buffer, and generally contains masks for pixel selection. I haven’t thoroughly investigated why precisely all these elements are marked, but for instance was presents higher subsurface scattering and hair and skin have its own shading, independent of the main lighting pass, which stencil allows to ignore.

  • Dracula: 85
  • Hair, skin and leather: 86
  • Window glass/blood/dripping wax: 133
  • Candles: 21

The first image below shows what the overdraw is like for this scene. A depth pre-pass helps if you have a lot of overdraw. The second image is the stencil buffer.

Depth Prepass Overdraw
Stencil
previous arrow
next arrow
 

GBuffer Pass

LoS2 uses a deferred pipeline, fully populating 4 G-Buffers. 4 buffers is quite big for a game that was released on Xbox360 and PS3, other games get away with 3 by using several optimizations.

Normals (in World Space):

normal.rnormal.gnormal.bsss

The normal buffer is populated with the three components of the world space normal and a subsurface scattering term for hair and wax (interestingly not skin). Opaque objects only transform their normal from tangent space to world space, but hair uses some form of normal shifting to give it anisotropic properties.

Normals RGB (World)
Normal SSS
previous arrow
next arrow
 

Albedo:

albedo.ralbedo.galbedo.balpha * AOLevels

The albedo buffer stores all three albedo components plus an ambient occlusion term that is stored per vertex in the alpha channel of the vertex color and is modulated by an AO constant (which I presume depends on the general lighting of the scene).

Albedo RGB
Albedo AO
previous arrow
next arrow
 

Specular:

specular.rspecular.gspecular.bFresnel multiplier

The specular buffer stores the specular color multiplied by a fresnel term that depends on the view and normal vectors. Although LoS2 does not use physically-based rendering, it includes a Fresnel term probably inspired in part by the Schlick approximation to try and brighten things up at glancing angles. It is not strictly correct, as it is done independently of the real-time lights. The Fresnel factor is also stored in the w component.

Specular RGB
Specular Fresnel Multiplier
previous arrow
next arrow
 

Continue reading

Replay system using Unity

Unity is an incredible tool for making quality games at a blazing fast pace. However, like all closed systems there are some limitations to how you can extend the engine and one such limitation is developing a good replay system for a game. I will talk about two possible approaches and how to solve other issues along the way. The system was devised for a Match 3 prototype, but can be applied to any project. There are commercial solutions available, but this post is intended for coders.

If the game has been designed with a deterministic outcome in mind, the most essential parts of recording are input events, delta times (optional) and random seeds. What this means is the only input available to the game will be the players’ actions, the rest should be simulated properly to arrive at the same outcome. Since storing random seeds and loading as appropriate is more or less straightforward, we will focus on the input.

1) The first issue is how to capture and replay input in the least disturbing way possible. If you have used Unity’s Input class, Input.GetMouseButton(i) should look familiar. The replay system was added after developing the main mechanics, and we didn’t want to go back and rewrite how the game worked. Such a system should ideally work for future or existing games, and Unity already provides a nice interface that other programmers use. Furthermore, plugins use this interface, and not sticking to it can severely limit your ability to record games.

The solution we arrived at was shadowing Unity’s Input class by creating a new class with the same name and accessing it through the UnityEngine namespace inside of the new class. This allows for conditional routing of Unity’s input, therefore passing recorded values into the Input.GetMouseButtonX functions, and essentially ‘tricking’ the game into thinking it is playing real player input. You can do the same with keys.

There are many functions and properties to override, it can take time and care to get it all working properly. Once you have this new layer you can create a RecordManager class and start creating methods that connect with the new Input class.

2) The second issue is trickier to get properly working, due to common misconceptions (myself included) about how Unity’s Update loops work. Unity has two different Update loops that serve different purposes, Update and FixedUpdate. Update runs at every frame, whereas FixedUpdate updates at a fixed, specified time interval. FixedUpdate has absolutely nothing to do with Update. No rule says that for every Update there should be a FixedUpdate, or that there should be no more than one for every Update.

Let’s explain it with two use cases. For both, the FixedUpdate interval is 0,017 s (~60 fps).

a) Update runs at 60 fps (same as FixedUpdate). The order of updates would be:

b) Update runs faster (120 fps). I have chosen this number because it is exactly double that of FixedUpdate. In this case, the order of updates would be as follows:
There is one FixedUpdate every two Update.

c) Update runs slower (30 fps). Same rule as above, but 30 = 60/2

Since FixedUpdate can’t keep up with Update, it updates twice to compensate.

This brings up the following question: where should I record input events, and where should I replay them? How can I replay something I recorded on one computer on another, and have the same output?

The answer to the first question is record in Update. It is guaranteed to run in every Unity tick, and doing so in FixedUpdate will cause you to miss input events and mess up your recording. The answer to the second question is a little more open, and depends on how you recorded your data.

One approach is to record the deltaTime in Update for every Update, and shadow Unity’s Time class the same way we did with Input to be able to read a recorded Time.deltaTime property wherever it’s used. This has two possible issues, namely precision (of the deltaTime) and storage.

The second approach is to save events and link them to their corresponding FixedUpdate tick, that way you can associate many events to a single tick (if Update goes too fast) or none (if Update goes too slow). With this approach you can only execute your code in FixedUpdate, and execute as many times as recorded Updates there are. It’s also important to save the average Update time of the original recording and set it as the FixedUpdate interval. The simulation will not be 100% accurate in that Update times won’t fluctuate as they did in the original recording session, but it is guaranteed to execute the same code.

ScriptExecutionOrder

There is one last setting that’s needed to properly record events, which is set the RecordManager to record all input at the beginning of every frame. Unity has a Script Execution Order option under Project Settings where you can set the RecordManager to run before any other script. That way recording and replaying are guaranteed to run