After showing an impressive demo last year and unleashing recently with the UE5 preview, Nanite is all the rage these days. I just had to go in and have some fun trying to figure it out and explain how I think it operates and the technical decisions behind it using a renderdoc capture. Props to Epic for being open with their tech which makes it easier to learn and pick apart; the editor has markers and debug information that are going to be super helpful.
This is the frame we’re going to be looking at, from the epic showdown in the Valley of the Ancient demo project. It shows the interaction between Nanite and non-Nanite geometry and it’s just plain badass.
The first stage in this process is Nanite::CullRasterize, and it looks like this. In a nutshell, this entire pass is responsible for culling instances and triangles and rasterizing them. We’ll refer to it as we go through the capture.
Instance culling is one of the first things that happens here. It looks to be a GPU form of frustum and occlusion culling. There is instance data and primitive data bound here, I guess it means it culls at the instance level first, and if the instance survives it starts culling at a finer-grained level. The Nanite.Views buffer provides camera info for frustum culling, and hierarchical depth buffer (HZB) is used for occlusion culling.
The HZB is sourced from the previous frame and forward projected to this one. I’m not sure how it deals with dynamic objects, it may be that it uses such a large mip (small resolution) that it is conservative enough. EDIT: According to the Nanite paper, the HZB is generated this frame with the previous frame’s visible objects. The HZB is tested with the previous objects as well as anything new and visibility updated for the next frame.
Both visible and non-visible instances are written into buffers. For the latter I’m thinking this is the way of doing what occlusion queries used to do in the standard mesh pipeline: inform the CPU that a certain entity is occluded and it should stop processing until it becomes visible. The visible instances are also written out into a list of candidates.
Persistent culling seems to be related to streaming. It is a fixed number of compute threads, suggesting it is unrelated to the complexity of the scene and instead maybe checks some spatial structure for occlusion. This is one complicated shader, but based on the inputs and outputs we can see it writes out how many triangle clusters are visible of each type (compute and traditional raster) into a buffer called MainRasterizeArgsSWHW (SW:compute, HW:raster).
Clustering and LODding
It’s worth mentioning LODs at this point as it is probably around here where those decisions are made. Some people speculated geometry images as a way to do continuous LODding but I see no indication of this. Triangles are grouped into patches called clusters, and some amount of culling is done at the cluster level. The clustering technique has been described before in papers by Ubisoft and Frostbite. For LODs, clusters start appearing and disappearing as the level of detail descends within instances. Some very clever magical incantations are employed here that ensure all the combinations of clusters stitch into each other seamlessly.
There seem to be two forms of rasterization present in the capture: compute and traditional draw-based. The previous buffer contained the arguments for two indirect executes that run these drawcalls.
- Render 3333 instances of 384 vertices each
- Run 34821 groups of this compute shader
The first drawcall uses traditional hardware-based rasterization. The criteria for choosing one or the other is unclear but if I had to guess it would be related to the size of the triangles relative to the size of the pixels. Epic has mentioned before that a compute rasterizer can outperform hardware in specific scenarios whereas in others the hardware has an edge. These scenarios relate to how the hardware chokes on very small triangles as it’s unable to schedule them efficiently, hurting occupancy and performance. I can find several instances of large triangles, but it’s hard to tell by just looking at it.
The information above also gives us an insight into cluster size (384 vertices, i.e. 128 triangles), a suspicious multiple of 32 and 64 that is generally chosen to efficiently fill the wavefronts on a GPU. So 3333 clusters are rendered using the hardware, and the dispatch then takes care of the rest of the Nanite geometry. Each group is 128 threads, so my assumption is that each thread processes a triangle (as each cluster is 128 triangles). A whopping ~5 million triangles! These numbers tell us over 90% of the geometry is software rasterized, a confirmation of what Brian Karis said here. For shadows the same process is followed, except at the end only depth is output.
The above process is repeated for a subset of geometry in the Post Pass. The reason for this seems to be that Nanite creates a more up to date HZB (in BuildPreviousOccluderHZB) with this frame’s depth information up to that point, combines it with the ZPrepass information (that stage happened before Nanite began) and uses that information to do more up to date occlusion culling. I wonder if the selection criteria for what gets culled here is at the “edges” of the previous depth buffer to avoid popping artifacts, or on geometry that was not visible last frame.In any case the output from the rasterization stage is a single texture that we’ll talk about next.
One of Nanite’s star features is the visibility buffer. It is a R32G32_UINT texture that contains triangle and depth information for each pixel. At this point no material information is present, so the first 32-bit integer is data necessary to access the properties later. Visibility buffers are not a new idea and have been discussed before (for example here and here) but as far as I know no commercial game has shipped with it. If deferred rendering decouples materials from lighting, this idea decouples geometry from materials: every pixel/triangle’s material is evaluated exactly once and no textures, buffers or resources are accessed that are later occluded. The visibility buffer is encoded as follows:
|R [31:7]||R [6:0]||G|
|ClusterID (25 bits)||Triangle ID (7 bits)||32-bit Depth|
There is an upper limit of ~4 billion (232) triangles, which I would have said is plenty in another time; now I’m not so sure anymore. One thing I have found very interesting here is how limited the information is. Other visibility buffer proposals suggested storing barycentric coordinates. Everything is being derived later by intersecting the triangle with the camera ray, reading the data from the original buffer, and recomputing/interpolating vertex quantities on the fly. This is described here in detail. As a final note, it is remarkable to see that crevice behind where the character is supposed to be standing to see the culling efficiency of the system.
This phase outputs three important quantities: depth, motion vectors and ‘material depths’. The first two are standard quantities that are later used for things like TAA, reflections, etc. There is an interesting texture called the Nanite Mask that just indicates where Nanite geometry was rendered. Other than that, this is what they look like:
However, by far the most interesting texture output by this phase is the Material Depth. This is essentially a material ID turned into a unique depth value and stored in a depth-stencil target. Effectively, there is one shade of grey per material. This is going to be used next as an optimization that takes advantage of Early Z.
Hopefully by now we have a good understanding of the geometry pipeline. Up to now, we’ve talked nothing at all about materials. This is quite interesting because between the visibility buffer generation and now, the frame actually spends a lot of time doing other things: light grid, sky atmosphere, etc and also renders the GBuffer as it would normally. This really drives home the separation between geometry and materials that the Visibility Buffer aims for. The important steps are inside Classify Materials and Emit GBuffer.
The material classification pass runs a compute shader that analyzes the fullscreen visibility buffer. This is very important for the next pass. The output of the process is a 20×12 (= 240) pixels R32G32_UINT texture called Material Range that encodes the range of materials present in the 64×64 region represented by each tile. It looks like this when viewed as a color texture.
We have finally reached the point where visibility meets materials, the point that the visibility buffer is all about, turning triangle information into surface properties. Unreal allows users to define arbitrary materials to surfaces so how do we efficiently manage that complexity? This is what Emit GBuffer looks like.
We have what looks like a drawcall per material ID, and every drawcall is a fullscreen quad chopped up into 240 squares rendered across the screen. One fullscreen drawcall per material? Have they gone mad? Not quite. We mentioned before that the material range texture was 240 pixels, so every quad of this fullscreen drawcall has a corresponding texel. The quad vertices sample this texture and check whether the tile is relevant to them, i.e. whether any pixel in the tile has the material they are going to render. If not, the x coordinate will be set to NaN and the whole quad discarded which is a well-defined operation.
As far as I can tell, the system uses 14 bits for material IDs, for a total of 16384 maximum materials. A constant buffer sends the material ID to the vertex shader so that it can check whether it’s in range.
On top of that, let’s remember that we created a material depth texture where every material ID is set to be a certain depth. These quads are output to the depth represented by their material and the depth mode is set to equal, so the hardware can then very quickly discard any pixels that aren’t relevant. As an extra step the engine has previously marked the stencil buffer as pixels that have nanite geometry and regular pixels, also used for Early Stencil optimizations. To see what this all means let’s look at the albedo buffer.
You may have noticed some of the quads are completely red, which I would have thought would be completely discarded by the vertex shader. However I think the material range texture is exactly what it says, a range of materials covered by that tile. If a material happens to be “in the middle” but none of the pixels have it, it will be considered as a candidate even though the depth test will discard it totally later. In any case that’s the main idea, the same process as shown in the images is repeated until all materials are processed. The final GBuffer is shown as tiles below.
Nanite ends at this point, and the rest of the pipeline carries on as normal thanks to the material decoupling deferred rendering offers. The work that has been put into this is truly remarkable. I’m sure there are a ton of details I am not aware of and imprecisions regarding how it works so I’m really looking forward to seeing what Brian Karis has to say about it at his SIGGRAPH deep dive this year.