A Macro View of Nanite

After showing an impressive demo last year and unleashing recently with the UE5 preview, Nanite is all the rage these days. I just had to go in and have some fun trying to figure it out and explain how I think it operates and the technical decisions behind it using a renderdoc capture. Props to Epic for being open with their tech which makes it easier to learn and pick apart; the editor has markers and debug information that are going to be super helpful.

This is the frame we’re going to be looking at, from the epic showdown in the Valley of the Ancient demo project. It shows the interaction between Nanite and non-Nanite geometry and it’s just plain badass.

Nanite::CullRasterize

The first stage in this process is Nanite::CullRasterize, and it looks like this. In a nutshell, this entire pass is responsible for culling instances and triangles and rasterizing them. We’ll refer to it as we go through the capture.

Instance Culling

Instance culling is one of the first things that happens here. It looks to be a GPU form of frustum and occlusion culling. There is instance data and primitive data bound here, I guess it means it culls at the instance level first, and if the instance survives it starts culling at a finer-grained level. The Nanite.Views buffer provides camera info for frustum culling, and hierarchical depth buffer (HZB) is used for occlusion culling. The HZB is sourced from the previous frame and forward projected to this one. I’m not sure how it deals with dynamic objects, it may be that it uses such a large mip (small resolution) that it is conservative enough.

Both visible and non-visible instances are written into buffers. For the latter I’m thinking this is the way of doing what occlusion queries used to do in the standard mesh pipeline: inform the CPU that a certain entity is occluded and it should stop processing until it becomes visible. The visible instances are also written out into a list of candidates.

Persistent Culling

Persistent culling seems to be related to streaming. It is a fixed number of compute threads, suggesting it is unrelated to the complexity of the scene and instead maybe checks some spatial structure for occlusion. This is one complicated shader, but based on the inputs and outputs we can see it writes out how many triangle clusters are visible of each type (compute and traditional raster) into a buffer called MainRasterizeArgsSWHW (SW:compute, HW:raster).

Clustering and LODding

It’s worth mentioning LODs at this point as it is probably around here where those decisions are made. Some people speculated geometry images as a way to do continuous LODding but I see no indication of this. Triangles are grouped into patches called clusters, and some amount of culling is done at the cluster level. The clustering technique has been described before in papers by Ubisoft and Frostbite. For LODs, clusters start appearing and disappearing as the level of detail descends within instances. Some very clever magical incantations are employed here that ensure all the combinations of clusters stitch into each other seamlessly.

Continue reading

The Rendering of Jurassic World: Evolution

Jurassic World: Evolution is the kind of game many kids (and adult-kids) dreamed of for a long time. What’s not to like about a game that gives you the reins of a park where the main attractions are 65-million-year-old colossal beasts? This isn’t the first successful amusement park game by Frontier Developments, but it’s certainly not your typical one. Frontier is a proud developer of their Cobra technology, which has been evolving since 1988. For JWE in particular it is a DX11 tiled forward renderer. For the analysis I used Renderdoc and turned on all the graphics bells and whistles. Welcome… to Jurassic Park.

The Frame

It’s hard to decide to present as a frame for this game, because free navigation and dynamic time of day means you have limitless possibilities, from a bird’s eye view to an extreme closeup of the dinosaurs, a sunset, a bright day or a hurricane. I chose a moody, rainy intermediate view that captures the dark essence of the original movies taking advantage of the Capture Mode introduced in version 1.7.

Compute Shaders

The first thing to notice about the frame is that it is very compute-heavy. In the absence of markers, Renderdoc splits rendering into passes if there are more than one Draw or Dispatch commands targeting the same output buffers. According to the capture there are 15 compute vs 18 color/depth passes, i.e. it is broadly split into half compute, half draw techniques. Compute can be more flexible than draw (and, if done correctly, faster) but a lot of time has to be spent fine-tuning and balancing performance. Frontier clearly spared no expense developing the technology to get there, however this also means that analyzing a frame is a bit harder.

Grass Displacement

A big component of JWE is foliage and its interaction with cars, dinosaurs, wind, etc. To animate the grass, one of the very first processes populates a top-down texture that contains grass displacement information. This grass displacement texture is later read in the vertex shader of all the grass in the game, and the information used to modify the position of the vertices of each blade of grass. The texture wraps around as the camera moves and fills in the new regions that appear at the edges. This means that the texture doesn’t necessarily look like a top-down snapshot of the scene, but will typically be split into 4 quadrants. The process involves these steps:

  1. Render dinosaurs and cars, probably other objects such as the gyrospheres. This doesn’t need an accurate version of the geometry, e.g. cars only render wheels and part of the chassis, which is in contact with grass. The result is a top down depth buffer (leftmost image). If you squint you’ll see the profile of an ankylosaurus. The other dinosaurs aren’t rendered here, perhaps the engine knows they aren’t stepping on grass and optimizes them out.
  2. Take this depth buffer and a heightmap of the scene (center image), and output three quantities: a mask to tell whether the depth of the object was above/below the terrain, the difference in depth between them, and the actual depth and pack them in a 3-channel texture (rightmost image)



An additional process simulates wind. In this particular scene there is a general breeze from the storm plus a helicopter, both producing currents that displace grass. This is a top down texture similar to the one before containing motion vectors in 2D. The motion for the wind is an undulating texture meant to mimic wind waves which seems to have been computed on the CPU, and the influence of the helicopter is cleverly done blending a stream of particles on top of the first texture. You can see it in the image as streams pulling outward. Dinosaur and car motion is also blended here. I’m not entirely sure what the purpose of the repeating texture is (you can see the same objects repeated multiple times).

Continue reading

Rendering Line Lights

Within the arsenal of lights provided by game engines, the most popular are punctual lights such as point, spot or directional because they are cheap. On the other end, area lights have recently produced incredible techniques such as Linearly Transformed Cosines and other analytic approximations. I want to talk about the line light.

Update [04/09/2020] When I originally wrote the article there were no public images showing Jedi or lightsabers so I couldn’t make the connection (though a clever reader could have concluded what they might be for!) I can finally show this work off as it’s meant to be.

In Unreal Engine 4, modifying ‘Source Length’ on a point light elongates it as described in this paper. It spreads the intensity along the length so a longer light becomes perceptually dimmer. Frostbite also has tube lights, a complex implementation of the analytical illuminance emitted by a cylinder and two spheres. Unity includes tube lights as well in their HD Render Pipeline (thanks Eric Heitz and Evegenii Golubev for pointing it out) based on their LTC theory, which you can find a great explanation and demos for here. Guerrilla Games’ Decima Engine has elongated quad lights using an approach for which they have a very attractive and thorough explanation in GPU Pro 5’s chapter II.1, Physically Based Area Lights. This is what I adapted to line lights.

Continue reading

The Rendering of Rise of the Tomb Raider

Rise of the Tomb Raider (2015) is the sequel to the excellent Tomb Raider (2013) reboot. I personally find both refreshing as they move away from the stagnating original series and retell the Croft story. The game is story focused and, like its prequel, offers enjoyable crafting, hunting and climbing/exploring mechanics.

Tomb Raider used the Crystal Engine, developed by Crystal Dynamics also used in Deus Ex: Human Revolution. For the sequel a new engine called Foundation was used, previously developed for Lara Croft and the Temple of Osiris (2014). Its rendering can be broadly classified as a tiled light-prepass engine, and we’ll see what that means as we dive in. The engine offers the choice between a DX11 and DX12 renderer; I chose the latter for reasons we’ll see later. I used Renderdoc 1.2 to capture the frame, on a Geforce 980 Ti, and turned on all the bells and whistles.

The Frame

I can safely say without spoilers that in this frame bad guys chase Lara because she’s looking for an artifact they’re looking for too, a conflict of interest that absolutely must be resolved using weapons. Lara is inside the enemy base at nighttime. I chose a frame with atmospheric and contrasty lighting where the engine can show off.

Depth Prepass

A customary optimization in many games, a small depth prepass takes place here (~100 draw calls). The game renders the biggest objects (rather the ones that take up the most screen space), to take advantage of the Early-Z capability of GPUs. A concise article by Intel explains further. In short, the GPU can avoid running a pixel shader if it can determine it’s occluded behind a previous pixel. It’s a relatively cheap pass that will pre-populate the Z-buffer with depth.

An interesting thing I found is a level of detail (LOD) technique called ‘fizzle’ or ‘checkerboard’. It’s a common way to fade objects in and out at a distance, either to later replace it with a lower quality mesh or to completely make it disappear. Take a look at this truck. It seems to be rendering twice, but in reality it’s rendering a high LOD and a low LOD at the same position, each rendering to the pixels the other is not rendering to. The first LOD is 182226 vertices, whereas the second LOD is 47250. They’re visually indistinguishable at a distance, and yet one is 3 times cheaper. In this frame, LOD 0 has almost disappeared while LOD 1 is almost fully rendered. Once LOD 0 completely disappears, only LOD 1 will render.

A pseudorandom texture and a probability factor allow us to discard pixels that don’t pass a threshold. You can see this texture used in ROTR. You might be asking yourself why not use alpha blending. There are many disadvantages to alpha blending over fizzle fading.

  1. Depth prepass-friendly: By rendering it like an opaque object and puncturing holes, we can still render into the prepass and take advantage of early-z. Alpha blended objects don’t render into the depth buffer this early due to sorting issues.
  2. Needs extra shader(s): If you have a deferred renderer, your opaque shader doesn’t do any lighting. You need a separate variant that does if you’re going to swap an opaque object for a transparent one. Aside from the memory/complexity cost of having at least an extra shader for all opaque objects, they need to be accurate to avoid popping. There are many reasons why this is hard, but it boils down to the fact they’re now rendering through a different code path.
  3. More overdraw: Alpha blending can produce more overdraw and depending on the complexity of your objects you might find yourself paying a large bandwidth cost for LOD fading.
  4. Z-fighting: z-fighting is the flickering effect when two polygons render to a very similar depth such that floating point imprecision causes them to “take turns” to render. If we render two consecutive LODs by fading one out and the next one in, they might z-fight since they’re so close together. There are ways around it like biasing one over the other but it gets tricky.
  5. Z-buffer effects: Many effects like SSAO rely on the depth buffer. If we render transparent objects at the end of the pipeline when ambient occlusion has run already, we won’t be able to factor them in.

One disadvantage of this technique is that it can look worse than alpha fading, but a good noise pattern, post-fizzle blurring or temporal AA can hide it to a large extent. ROTR doesn’t do anything fancy in this respect.

Normals Pass

Crystal Dynamics uses a relatively unusual lighting scheme for its games that we’ll describe in the lighting pass. For now suffice it to say that there is no G-Buffer pass, at least not in the sense that other games have us accustomed to. Instead, the objects in this pass only output depth and normals information. Normals are written to an RGBA16_SNORM render target in world space. As a curiosity, this engine uses Z-up as opposed to Y-up which is what I see more often in other engines/modelling packages. The alpha channel contains glossiness, which will be decompressed later as exp2(glossiness * 12 + 1.0). The glossiness value can actually be negative, as the sign is used as a flag to indicate whether a surface is metallic or not. You can almost spot it yourself, as the darker colors in the alpha channel are all metallic objects.

RGBA
Normal.xNormal.yNormal.zGlossiness + Metalness

Normals
Glossiness/Metalness
 
previous arrow
next arrow

 
Continue reading

A real life pinhole camera

When I got married last year, me and my wife went on our honeymoon to Thailand. Their king Bhumibol had died just a month ago and the whole country was mourning, so everywhere we found memorials and good wishes for their king, and people would dress in black and white as a sign of sorrow. The Thai are a gentle and polite people, who like to help out; we’d ask for directions and people with no notions of English would spend twenty minutes trying to understand and answer our questions. Thailand has a rich history of rising and falling kingdoms, great kings and battles, and unification and invasions by foreign kingdoms. There are some amazing ruins of these kingdoms. Thailand also lives by a variant of Buddhism reflected in all of their beautiful temples. Some of the architectural features I found most interesting are the small reflective tiles that cover the outer walls, animal motives like the Garuda, (bird creatures that can be seen on the rooftops) and snake-like creatures called Naga It is in this unexpected context that I found a real-life pinhole camera. I always wear my graphics hat so I decided to capture it and later make a post.

First, a little background. A pinhole camera (also known as camera obscura after its latin name) is essentially the simplest camera you can come up with. If you conceptually imagine a closed box that has a single, minuscule hole in one of its faces, such that a single ray from each direction can come inside, you’d have a mirrored image at the inner face of the other side of the box to where the pinhole is. An image is worth more than a thousand explanations, so here’s what I’m talking about.

 

Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
Pinhole Diagram
 
previous arrow
next arrow

 

As you can see, the concept is simple. If you were inside the room, you’d see an inverted image of the outside. The hole is so small the room would be fairly dark so even the faint light now bouncing back towards you would still be visible. I made the pinhole a hexagon, as I wanted to suggest the fact that it is effectively the shutter of a modern camera. Louis Daguerre, one of the fathers of photography, used this model in his famous daguerreotype circa 1835, but Leonardo da Vinci had already described this phenomenon as an oculus artificialis (artificial eye) in one of his works in as early as 1502. There are plenty additional resources if you’re interested and even a pretty cool tutorial on how to create your own.

Now that we understand what this camera is, let’s look at the real image I encountered. I’ve aligned the inside and outside images I took and cast rays so you can see what I mean.

 

Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
Real Pinhole Camera
 
previous arrow
next arrow

 

The image of the inside looks bright but I had to take it with 1 second of exposure and it still looks relatively dark. On top of that the day outside was very sunny which helped a lot in getting a clear “photograph”.

The Rendering of Middle Earth: Shadow of Mordor

Middle Earth: Shadow of Mordor was released in 2014. The game itself was a great surprise, and the fact that it was a spin-off within the storyline of the Lord of the Rings universe was quite unusual and it’s something I enjoyed. The game was a great success, and at the time of writing, Monolith has already released the sequel, Shadow of War. The game’s graphics are beautiful, especially considering it was a cross-generation game and was also released on Xbox 360 and PS3. The PC version is quite polished and features a few extra graphical options and hi-resolution texture packs that make it shine.

The game uses a relatively modern deferred DX11 renderer. I used Renderdoc to delve into the game’s rendering techniques. I used the highest possible graphical settings (ultra) and enabled all the bells and whistles like order-independent transparency, tessellation, screen-space occlusion and the different motion blurs.

The Frame

This is the frame we’ll be analyzing. We’re at the top of a wooden scaffolding in the Udun region. Shadow of Mordor has similar mechanics to games like Assassin’s Creed where you can climb buildings and towers and enjoy some beautiful digital scenery from them.

Depth Prepass

The first ~140 draw calls perform a quick prepass to render the biggest elements of the terrain and buildings into the depth buffer. Most things don’t end up appearing in this prepass, but it helps when you’ve got a very big number of draw calls and a far range of view. Interestingly the character, who is always in front and takes a decent amount of screen space, does not go into the prepass. As is common for many open world games, the game employs reverse z, a technique that maps the near plane to 1.0 and far plane to 0.0 for increased precision at great distances and to prevent z-fighting. You can read more about z-buffer precision here.

 

G-buffer

Right after that, the G-Buffer pass begins, with around ~2700 draw calls. If you’ve read my previous analysis for Castlevania: Lords of Shadow 2 or have read other similar articles, you’ll be familiar with this pass. Surface properties are written to a set of buffers that are read later on by lighting passes to compute its response to the light. Shadow of Mordor uses a classical deferred renderer, but uses a comparably small amount of G-buffer render targets (3) to achieve its objective. Just for comparison, Unreal Engine uses between 5 and 6 buffers in this pass. The G-buffer layout is as follows:

Normals Buffer
RGBA
Normal.xNormal.yNormal.zID

The normals buffer stores the normals in world space, in 8-bit per channel format. This is a little bit tight, sometimes not enough to accurately represent smoothly varying flat surfaces, as can be seen in some puddles throughout the game if paying close attention. The alpha channel is used as an ID that marks different types of objects. Some that I’ve found correspond to a character (255), an animated plant or flag (128), and the sky is marked with ID 1, as it’s later used to filter it out during the bloom phase (it gets its own radial bloom).

World Space Normals
Object ID
 
previous arrow
next arrow

Continue reading

Photoshop Blend Modes Without Backbuffer Copy

For the past couple of weeks, I have been trying to replicate the Photoshop blend modes in Unity. It is no easy task; despite the advances of modern graphics hardware, the blend unit still resists being programmable and will probably remain fixed for some time. Some OpenGL ES extensions implement this functionality, but most hardware and APIs don’t. So what options do we have?

1) Backbuffer copy

A common approach is to copy the entire backbuffer before doing the blending. This is what Unity does. After that it’s trivial to implement any blending you want in shader code. The obvious problem with this approach is that you need to do a full backbuffer copy before you do the blending operation. There are certainly some possible optimizations like only copying what you need to a smaller texture of some sort, but it gets complicated once you have many objects using blend modes. You can also do just a single backbuffer copy and re-use it, but then you can’t stack different blended objects on top of each other. In Unity, this is done via a GrabPass. It is the approach used by the Blend Modes plugin.

2) Leveraging the Blend Unit

Modern GPUs have a little unit at the end of the graphics pipeline called the Output Merger. It’s the hardware responsible for getting the output of a pixel shader and blending it with the backbuffer. It’s not programmable, as to do so has quite a lot of complications (you can read about it here) so current GPUs don’t have one.

The blend mode formulas were obtained here and here. Use it as reference to compare it with what I provide. There are many other sources. One thing I’ve noticed is that provided formulas often neglect to mention that Photoshop actually uses modified formulas and clamps quantities in a different manner, especially when dealing with alpha. Gimp does the same. This is my experience recreating the Photoshop blend modes exclusively using a combination of blend unit and shaders. The first few blend modes are simple, but as we progress we’ll have to resort to more and more tricks to get what we want.

Two caveats before we start. First off, Photoshop blend modes do their blending in sRGB space, which means if you do them in linear space they will look wrong. Generally this isn’t a problem, but due to the amount of trickery we’ll be doing for these blend modes, many of the values need to go beyond the 0 – 1 range, which means we need an HDR buffer to do the calculations. Unity can do this by setting the camera to be HDR in the camera settings, and also setting Gamma for the color space in the Player Settings. This is clearly undesirable if you do your lighting calculations in linear space. In a custom engine you would probably be able to set this up in a different manner (to allow for linear lighting).

If you want to try the code out while you read ahead, download it here.

A) Darken

Formulamin(SrcColor, DstColor)
Shader Output
Blend UnitMin(SrcColor · One, DstColor · One)

darken

As alpha approaches 0, we need to tend the minimum value to DstColor, by forcing SrcColor to be the maximum possible color float3(1, 1, 1)

B) Multiply

FormulaSrcColor · DstColor
Shader Output
Blend UnitSrcColor · DstColor + DstColor · OneMinusSrcAlpha

multiply

Continue reading

The Rendering of Castlevania: Lords of Shadow 2

Castlevania Lords of Shadow 2 was released in 2014, a sequel that builds on top of Lords of Shadow, its first installment, which uses a similar engine. I hold these games dear and, being Spanish myself, I’m very proud of the work MercurySteam, a team from Madrid, did on all three modern reinterpretations of the Castlevania series (Lords of Shadow, Mirror of Fate and Lords of Shadow 2). Out of curiosity and pure fandom for the game I decided to peek into the Mercury Engine. Despite the first Lords of Shadow being, without shadow of a doubt (no pun intended), the best and most enjoyable of the new Castlevanias, out of justice for their hard work I decided to analyze a frame from their latest and most polished version of the engine. Despite being a recent game, it uses DX9 as graphics backend. Many popular tools like RenderDoc or the newest tools by Nvidia and AMD don’t support DX9, so I used Intel Graphics Analyzer to capture and analyze all the images and code from this post. While having a bit of graphics parlance, I’ve tried to include as many images as possible, with occasional code and in-depth explanations.

Analyzing a Frame

This is the frame we’re going to be looking at. It’s the beginning scene of Lords of Shadow 2, Dracula has just awakened, enemies are knocking at his door and he is not in the best mood.

CLOS2 Castle Final Frame

Depth Pre-pass

LoS2 appears to do what is called a depth pre-pass. What it means is you send the geometry once through the pipeline with very simple shaders, and pre-emptively populate the depth buffer. This is useful for the next pass (Gbuffer), as it attempts to avoid overdraw, so pixels with a depth value higher than the one already in the buffer (essentially, pixels that are behind) get discarded before they run the pixel shader, therefore minimizing pixel shader runs at the cost of extra geometry processing. Alpha tested geometry, like hair and a rug with holes, are also included in the pre-pass. LoS2 uses both the standard depth buffer and a depth-as-color buffer to be able to sample the depth buffer as a texture in a later stage.

The game also takes the opportunity to fill in the stencil buffer, an auxiliary buffer that is part of the depth buffer, and generally contains masks for pixel selection. I haven’t thoroughly investigated why precisely all these elements are marked, but for instance was presents higher subsurface scattering and hair and skin have its own shading, independent of the main lighting pass, which stencil allows to ignore.

  • Dracula: 85
  • Hair, skin and leather: 86
  • Window glass/blood/dripping wax: 133
  • Candles: 21

The first image below shows what the overdraw is like for this scene. A depth pre-pass helps if you have a lot of overdraw. The second image is the stencil buffer.

Depth Prepass Overdraw
Stencil
previous arrow
next arrow
 

GBuffer Pass

LoS2 uses a deferred pipeline, fully populating 4 G-Buffers. 4 buffers is quite big for a game that was released on Xbox360 and PS3, other games get away with 3 by using several optimizations.

Normals (in World Space):

normal.rnormal.gnormal.bsss

The normal buffer is populated with the three components of the world space normal and a subsurface scattering term for hair and wax (interestingly not skin). Opaque objects only transform their normal from tangent space to world space, but hair uses some form of normal shifting to give it anisotropic properties.

Normals RGB (World)
Normal SSS
previous arrow
next arrow
 

Albedo:

albedo.ralbedo.galbedo.balpha * AOLevels

The albedo buffer stores all three albedo components plus an ambient occlusion term that is stored per vertex in the alpha channel of the vertex color and is modulated by an AO constant (which I presume depends on the general lighting of the scene).

Albedo RGB
Albedo AO
previous arrow
next arrow
 

Specular:

specular.rspecular.gspecular.bFresnel multiplier

The specular buffer stores the specular color multiplied by a fresnel term that depends on the view and normal vectors. Although LoS2 does not use physically-based rendering, it includes a Fresnel term probably inspired in part by the Schlick approximation to try and brighten things up at glancing angles. It is not strictly correct, as it is done independently of the real-time lights. The Fresnel factor is also stored in the w component.

Specular RGB
Specular Fresnel Multiplier
previous arrow
next arrow
 

Continue reading