Photoshop Blend Modes Without Backbuffer Copy

For the past couple of weeks, I have been trying to replicate the Photoshop blend modes in Unity. It is no easy task; despite the advances of modern graphics hardware, the blend unit still resists being programmable and will probably remain fixed for some time. Some OpenGL ES extensions implement this functionality, but most hardware and APIs don’t. So what options do we have?

1) Backbuffer copy

A common approach is to copy the entire backbuffer before doing the blending. This is what Unity does. After that it’s trivial to implement any blending you want in shader code. The obvious problem with this approach is that you need to do a full backbuffer copy before you do the blending operation. There are certainly some possible optimizations like only copying what you need to a smaller texture of some sort, but it gets complicated once you have many objects using blend modes. You can also do just a single backbuffer copy and re-use it, but then you can’t stack different blended objects on top of each other. In Unity, this is done via a GrabPass. It is the approach used by the Blend Modes plugin.

2) Leveraging the Blend Unit

Modern GPUs have a little unit at the end of the graphics pipeline called the Output Merger. It’s the hardware responsible for getting the output of a pixel shader and blending it with the backbuffer. It’s not programmable, as to do so has quite a lot of complications (you can read about it here) so current GPUs don’t have one.

The blend mode formulas were obtained here and here. Use it as reference to compare it with what I provide. There are many other sources. One thing I’ve noticed is that provided formulas often neglect to mention that Photoshop actually uses modified formulas and clamps quantities in a different manner, especially when dealing with alpha. Gimp does the same. This is my experience recreating the Photoshop blend modes exclusively using a combination of blend unit and shaders. The first few blend modes are simple, but as we progress we’ll have to resort to more and more tricks to get what we want.

Two caveats before we start. First off, Photoshop blend modes do their blending in sRGB space, which means if you do them in linear space they will look wrong. Generally this isn’t a problem, but due to the amount of trickery we’ll be doing for these blend modes, many of the values need to go beyond the 0 – 1 range, which means we need an HDR buffer to do the calculations. Unity can do this by setting the camera to be HDR in the camera settings, and also setting Gamma for the color space in the Player Settings. This is clearly undesirable if you do your lighting calculations in linear space. In a custom engine you would probably be able to set this up in a different manner (to allow for linear lighting).

If you want to try the code out while you read ahead, download it here.

A) Darken

Formula min(SrcColor, DstColor)
Shader Output
Blend Unit Min(SrcColor · One, DstColor · One)

darken

As alpha approaches 0, we need to tend the minimum value to DstColor, by forcing SrcColor to be the maximum possible color float3(1, 1, 1)

B) Multiply

Formula SrcColor · DstColor
Shader Output
Blend Unit SrcColor · DstColor + DstColor · OneMinusSrcAlpha

multiply

C) Color Burn

Formula 1 – (1 – DstColor) / SrcColor
Shader Output
Blend Unit SrcColor · One + DstColor · OneMinusSrcColor

color-burn

D) Linear Burn

Formula SrcColor + DstColor – 1
Shader Output
Blend Unit SrcColor · One + DstColor · One

linear-burn

E) Lighten

Formula Max(SrcColor, DstColor)
Shader Output
Blend Unit Max(SrcColor · One, DstColor · One)

lighten

F) Screen

Formula 1 – (1 – DstColor) · (1 – SrcColor) = Src + Dst – Src · Dst
Shader Output
Blend Unit SrcColor · One + DstColor · OneMinusSrcColor

screen

G) Color Dodge

Formula DstColor / (1 – SrcColor)
Shader Output
Blend Unit SrcColor · DstColor + DstColor · Zero

color-dodge

You can see discrepancies between the Photoshop and the Unity version in the alpha blending, especially at the edges.

H) Linear Dodge

Formula SrcColor + DstColor
Shader Output
Blend Unit SrcColor · SrcAlpha + DstColor · One

linear-dodge

This one also exhibits color “bleeding” at the edges. To be honest I prefer the one to the right just because it looks more “alive” than the other one. Same goes for Color Dodge. However this limits the 1-to-1 mapping to Photoshop/Gimp.

All of the previous blend modes have simple formulas and one way or another they can be implemented via a few instructions and the correct blending mode. However, some blend modes have conditional behavior or complex expressions (complex relative to the blend unit) that need a bit of re-thinking. Most of the blend modes that follow needed a two-pass approach (using the Pass syntax in your shader). Two-pass shaders in Unity have a limitation in that the two passes aren’t guaranteed to render one after the other for a given material. These blend modes rely on the previous pass, so you’ll get weird artifacts. If you have two overlapping sprites (as in a 2D game, such as our use case) the sorting will be undefined. The workaround around this is to move the Order in Layer property to force them to sort properly.

I) Overlay

Formula 1 – (1 – 2 · (DstColor – 0.5)) · (1 – SrcColor), if DstColor > 0.5
2 · DstColor · SrcColor, if DstColor <= 0.5
Shader Pass 1
Blend Pass 1 SrcColor · DstColor + DstColor · DstColor
Shader Pass 2
Blend Pass 2 SrcColor · DstColor + DstColor · Zero

overlay

How I ended up with Overlay requires an explanation. We take the original formula and approximate via a linear blend:

{ 2 \cdot Dst \cdot Src \cdot (1 - Dst) + (1 - 2 \cdot (1 - Dst) \cdot (1 - Src)) \cdot Dst}

We simplify as much as we can and end up with this

{ (4 \cdot Src - 1) \cdot Dst + (2 - 4 \cdot Src) \cdot Dst \cdot Dst }

The only way I found to get DstColor · DstColor is to isolate the term and do it in two passes, therefore we extract the same factor in both sides:

{ \Big[(4 \cdot Src - 1) \cdot \frac {Dst} {(2 - 4 \cdot Src)} + Dst \cdot Dst\Big] \cdot (2 - 4 \cdot Src) }

However this formula doesn’t take alpha into account. We still need to linearly interpolate this big formula with alpha, where an alpha of 0 should return Dst. Therefore

{ \Big[(4 \cdot Src - 1) \cdot \frac {Dst} {(2 - 4 \cdot Src)} + Dst \cdot Dst\Big] \cdot (2 - 4 \cdot Src) \cdot a + Dst \cdot (1 - a) }

If we include the last term into the original formula, we can still do it in 2 passes. We need to be careful to clamp the alpha value with max(0.001, a) because we’re now potentially dividing by 0. The final formula is

{ K_1 = \frac{4 \cdot Src - 1} {2 - 4 \cdot Src} }

{ K_2 = \frac{1 - a} {(2 - 4 \cdot Src) \cdot a} }

{ \Big[Dst \cdot (K_1 + K_2) + Dst \cdot Dst \Big] \cdot (2 - 4 \cdot Src) \cdot a }

J) Soft Light

Formula 1 – (1 – DstColor) · (1 – (SrcColor – 0.5)), if SrcColor > 0.5
DstColor · (SrcColor + 0.5), if SrcColor <= 0.5
Shader Pass 1
Blend Pass 1 SrcColor · DstColor + SrcColor · DstColor
Shader Pass 2
Blend Pass 2 SrcColor · DstColor + SrcColor * Zero

soft-light

For the Soft Light we apply a very similar reasoning to Overlay, which in the end leads us to Pegtop’s formula. Both are different from Photoshop’s version in that they don’t have discontinuities. This one also has a darker fringe when alpha blending.

K) Hard Light

Formula 1 – (1 – DstColor) · (1 – 2 · (SrcColor – 0.5)), if SrcColor> 0.5
DstColor · (2 · SrcColor), if SrcColor <= 0.5
Shader Pass 1
Blend Pass 1 SrcColor · One + DstColor · One
Shader Pass 2
Blend Pass 2 SrcColor · DstColor + SrcColor * Zero

hard-light

Hard Light has a very delicate hack that allows it to work and blend with alpha. In the first pass we divide by some magic number, only to multiply it back in the second pass! That’s because when alpha is 0 it needs to result in DstColor, but it was resulting in black.

L) Vivid Light

Formula 1 – (1 – DstColor) / (2 · (SrcColor – 0.5)), if SrcColor > 0.5
DstColor / (1 – 2 · SrcColor), if SrcColor <= 0.5
Shader Pass 1
Blend Pass 1 SrcColor · DstColor + SrcColor · Zero
Shader Pass 2
Blend Pass 2 SrcColor · One+ SrcColor · OneMinusSrcColor

vivid-light

M) Linear Light

Formula DstColor + 2 · (SrcColor – 0.5), if SrcColor > 0.5
DstColor + 2 · SrcColor – 1, if SrcColor <= 0.5
Shader Output
Blend Unit  SrcColor · One + DstColor · One

linear-light

The Rendering of Castlevania: Lords of Shadow 2

Castlevania Lords of Shadow 2 was released in 2014, a sequel that builds on top of Lords of Shadow, its first installment, which uses a similar engine. I hold these games dear and, being Spanish myself, I’m very proud of the work MercurySteam, a team from Madrid, did on all three modern reinterpretations of the Castlevania series (Lords of Shadow, Mirror of Fate and Lords of Shadow 2). Out of curiosity and pure fandom for the game I decided to peek into the Mercury Engine. Despite the first Lords of Shadow being, without shadow of a doubt (no pun intended), the best and most enjoyable of the new Castlevanias, out of justice for their hard work I decided to analyze a frame from their latest and most polished version of the engine. Despite being a recent game, it uses DX9 as graphics backend. Many popular tools like RenderDoc or the newest tools by Nvidia and AMD don’t support DX9, so I used Intel Graphics Analyzer to capture and analyze all the images and code from this post. While having a bit of graphics parlance, I’ve tried to include as many images as possible, with occasional code and in-depth explanations.

Analyzing a Frame

This is the frame we’re going to be looking at. It’s the beginning scene of Lords of Shadow 2, Dracula has just awakened, enemies are knocking at his door and he is not in the best mood.

CLOS2 Castle Final Frame

Depth Pre-pass

LoS2 appears to do what is called a depth pre-pass. What it means is you send the geometry once through the pipeline with very simple shaders, and pre-emptively populate the depth buffer. This is useful for the next pass (Gbuffer), as it attempts to avoid overdraw, so pixels with a depth value higher than the one already in the buffer (essentially, pixels that are behind) get discarded before they run the pixel shader, therefore minimizing pixel shader runs at the cost of extra geometry processing. Alpha tested geometry, like hair and a rug with holes, are also included in the pre-pass. LoS2 uses both the standard depth buffer and a depth-as-color buffer to be able to sample the depth buffer as a texture in a later stage.

The game also takes the opportunity to fill in the stencil buffer, an auxiliary buffer that is part of the depth buffer, and generally contains masks for pixel selection. I haven’t thoroughly investigated why precisely all these elements are marked, but for instance was presents higher subsurface scattering and hair and skin have its own shading, independent of the main lighting pass, which stencil allows to ignore.

  • Dracula: 85
  • Hair, skin and leather: 86
  • Window glass/blood/dripping wax: 133
  • Candles: 21

The first image below shows what the overdraw is like for this scene. A depth pre-pass helps if you have a lot of overdraw. The second image is the stencil buffer.

Arrow
Arrow
Depth Prepass Overdraw
Slider

GBuffer Pass

LoS2 uses a deferred pipeline, fully populating 4 G-Buffers. 4 buffers is quite big for a game that was released on Xbox360 and PS3, other games get away with 3 by using several optimizations.

Normals (in World Space):

normal.r normal.g normal.b sss

The normal buffer is populated with the three components of the world space normal and a subsurface scattering term for hair and wax (interestingly not skin). Opaque objects only transform their normal from tangent space to world space, but hair uses some form of normal shifting to give it anisotropic properties.

Arrow
Arrow
Normals RGB (World)
Slider

Albedo:

albedo.r albedo.g albedo.b alpha * AOLevels

The albedo buffer stores all three albedo components plus an ambient occlusion term that is stored per vertex in the alpha channel of the vertex color and is modulated by an AO constant (which I presume depends on the general lighting of the scene).

Arrow
Arrow
Albedo RGB
Slider

Specular:

specular.r specular.g specular.b Fresnel multiplier

The specular buffer stores the specular color multiplied by a fresnel term that depends on the view and normal vectors. Although LoS2 does not use physically-based rendering, it includes a Fresnel term probably inspired in part by the Schlick approximation to try and brighten things up at glancing angles. It is not strictly correct, as it is done independently of the real-time lights. The Fresnel factor is also stored in the w component.

Arrow
Arrow
Specular RGB
Slider

Ambient Lighting:

ambient.r ambient.g ambient.b AO constant

The Ambient buffer stores colored ambient lighting and occlusion. It takes the input vertex color and multiplies it by a constant AO factor (different from the AO factor for the albedo). Static geometry uses lightmaps, as is standard practice in many games, but animated geometry using normal maps uses a different technique. My first hypothesis without looking at the code was that they would be using spherical harmonics, but after looking at the assembly I think it’s based on a technique described by Valve in 2006 for Half-Life 2.

The technique works like this (look at the assembly to follow what I say): first the normal is calculated in world space, and the positive and negative components separated. Then those components are squared, and multiplied by two different matrices contained in PrecalcAOColors, which is passed as a constant. These matrices are described in the Valve paper as an Ambient Cube, containing six colors. It is a technique that was developed around the time that spherical harmonics were developed, but is more compact as it only uses 6 colors (9 are needed for the most basic spherical harmonics) and is faster to evaluate.

After that both contributions are added and multiplied by a constant, and then added back again to the vertex colors. The last component is a constant coming from either the lightmap for static geometry, or the PrecalcAOLevels (light probes) for dynamic geometry.

Arrow
Arrow
Ambient RGB
Slider

Continue reading

Replay system using Unity

Unity is an incredible tool for making quality games at a blazing fast pace. However, like all closed systems there are some limitations to how you can extend the engine and one such limitation is developing a good replay system for a game. I will talk about two possible approaches and how to solve other issues along the way. The system was devised for a Match 3 prototype, but can be applied to any project. There are commercial solutions available, but this post is intended for coders.

If the game has been designed with a deterministic outcome in mind, the most essential parts of recording are input events, delta times (optional) and random seeds. What this means is the only input available to the game will be the players’ actions, the rest should be simulated properly to arrive at the same outcome. Since storing random seeds and loading as appropriate is more or less straightforward, we will focus on the input.

1) The first issue is how to capture and replay input in the least disturbing way possible. If you have used Unity’s Input class, Input.GetMouseButton(i) should look familiar. The replay system was added after developing the main mechanics, and we didn’t want to go back and rewrite how the game worked. Such a system should ideally work for future or existing games, and Unity already provides a nice interface that other programmers use. Furthermore, plugins use this interface, and not sticking to it can severely limit your ability to record games.

The solution we arrived at was shadowing Unity’s Input class by creating a new class with the same name and accessing it through the UnityEngine namespace inside of the new class. This allows for conditional routing of Unity’s input, therefore passing recorded values into the Input.GetMouseButtonX functions, and essentially ‘tricking’ the game into thinking it is playing real player input. You can do the same with keys.

There are many functions and properties to override, it can take time and care to get it all working properly. Once you have this new layer you can create a RecordManager class and start creating methods that connect with the new Input class.

2) The second issue is trickier to get properly working, due to common misconceptions (myself included) about how Unity’s Update loops work. Unity has two different Update loops that serve different purposes, Update and FixedUpdate. Update runs at every frame, whereas FixedUpdate updates at a fixed, specified time interval. FixedUpdate has absolutely nothing to do with Update. No rule says that for every Update there should be a FixedUpdate, or that there should be no more than one for every Update.

Let’s explain it with two use cases. For both, the FixedUpdate interval is 0,017 s (~60 fps).

a) Update runs at 60 fps (same as FixedUpdate). The order of updates would be:

b) Update runs faster (120 fps). I have chosen this number because it is exactly double that of FixedUpdate. In this case, the order of updates would be as follows:
There is one FixedUpdate every two Update.

c) Update runs slower (30 fps). Same rule as above, but 30 = 60/2

Since FixedUpdate can’t keep up with Update, it updates twice to compensate.

This brings up the following question: where should I record input events, and where should I replay them? How can I replay something I recorded on one computer on another, and have the same output?

The answer to the first question is record in Update. It is guaranteed to run in every Unity tick, and doing so in FixedUpdate will cause you to miss input events and mess up your recording. The answer to the second question is a little more open, and depends on how you recorded your data.

One approach is to record the deltaTime in Update for every Update, and shadow Unity’s Time class the same way we did with Input to be able to read a recorded Time.deltaTime property wherever it’s used. This has two possible issues, namely precision (of the deltaTime) and storage.

The second approach is to save events and link them to their corresponding FixedUpdate tick, that way you can associate many events to a single tick (if Update goes too fast) or none (if Update goes too slow). With this approach you can only execute your code in FixedUpdate, and execute as many times as recorded Updates there are. It’s also important to save the average Update time of the original recording and set it as the FixedUpdate interval. The simulation will not be 100% accurate in that Update times won’t fluctuate as they did in the original recording session, but it is guaranteed to execute the same code.

ScriptExecutionOrder

There is one last setting that’s needed to properly record events, which is set the RecordManager to record all input at the beginning of every frame. Unity has a Script Execution Order option under Project Settings where you can set the RecordManager to run before any other script. That way recording and replaying are guaranteed to run

Java and Vector Math

Some time ago, I had to develop a 3D vector utility class for Java. Because the Android platform only uses Java, this is a must if you’re developing for it (unless you’re directly using the NDK or some middleware such as Unity. I’ll get back to this later).

Java, like all programming languages, has its virtues and weaknesses. I found trying to develop a solid Vector3 class to be one such weakness, because Java lacks two main features that I consider core to what I was trying to achieve: stack-allocated objects and operator overloading. These missing features make operating with vectors in Java an annoyance beyond measure.

As much as some people seem to dislike operator overloading, vector/matrix math is one domain where I consider they excel, and the lack of it is going to force me to always go through functions for even the simplest of operations, such as adding/subtracting two vectors or multiplying/dividing by a scalar.

Compare the following lines of code:

Doesn’t seem too bad, does it? Let’s try something different, like obtaining a direction from two points, normalizing, scaling by a factor, and adding it to a point (a relatively frequent operation)

It’s either that or separating into several lines so it becomes clearer and a bit more readable. This is clearly an undesirable way of working with vectors, but the only at our disposal when using Java.

There’s another caveat, though, one that is implicit in the way we have used the equal operator until now. We have been assigning the Vector3 by reference all this time, invalidating the erroneous assumption that we get a new Vector3 out of the operation. What we want is a copy of the resulting Vector3, which means create a new Vector3 using the new operator, and copy the values into the new Vector3. Therefore, line [2] would become something along the lines of:

or

In any case it is a very confusing way of working with vectors.

There is yet another annoyance to be wary of when working with instantiations on the heap in a garbage collected language such as Java which is, precisely, the dreaded Garbage Collector. Vector3 operations typically go inside long loops where interesting calculations take place for many objects in the virtual world, and creating all those new objects in the heap is asking for trouble when the GC comes to inspect your pile of vector rubbish and tries to clean up the mess. This is due to the fact that there are no stack-allocated objects in Java, leaving us with one option – creating temporary Vector3’s and reusing them. This has its fair share of problems too – mainly readability, the ability to use several temporary vectors for intermediate calculations, and having to pass the Vector3 reference as a parameter instead of returning it as a normal function return value. Let’s go back to our example.

Definitely not ideal. In contrast, and for all its similarity with Java, C# has both these features, which makes it a language of choice for these kinds of applications, and I wonder if it was one among the multiple reasons why Unity chose it as their main development language.