Modern GPUs are throughput machines. One of the ways they achieve significant gains over CPUs is by batching huge amounts of similar work together. For example, doing shading calculations for every pixel in a 1920x1080 frame is effectively executing the same function 1920*1080 times with slightly different inputs. GPU's take advantage of this workload similarity by using a SIMD architecture - Single Instruction, Multiple Data.

At the lowest level of GPU programming there are individual threads. These threads may represent individual vertices running a vertex shader, pixels (or "fragments") running a pixel shader, or compute threads when dealing with compute shaders. Threads are grouped together into a complicated hierarchy with multiple layers, the complete picture of how they do this is frustratingly complicated and difficult to track down online. But for our understanding it's sufficient to know that GPU's batch work into groups of 32 or 64 threads that execute in lockstep. These groups go by many different names, "warp", "wavefront", "thread-group", "SIMD-group". On Nvidia hardware, these things are called warps, and they always contain 32 threads. Multiple warps will be active at the same time, but communication / synchronization between them is either complicated or impossible.

32 pixels within a warp will execute the fragment shader on the right. Every single one of the pixels within the warp will execute the same line of code at the same time. Other warps will also be executing the code while the frame is rendering, but the relative order that this happens in is undefined.

32 pixels within a warp will execute the fragment shader on the right. Every single one of the pixels within the warp will execute the same line of code at the same time. Other warps will also be executing the code while the frame is rendering, but the relative order that this happens in is undefined.

It is useful to have a ballpark understanding of how many threads are active at once. On my laptop's GTX 1070, there are on the order of 1,024 total warps, which means that there can be at-maximum 1024*32 = 32,768 pixels being actively shaded at once. The actual number will be lower than this, because warps have to slow down and wait for things like texture() reads. Notice that 32,768 is significantly lower than the number of pixels even in a small 1280x720 frame, this means that portions of the frame are rendered sequentially as warps finish and move onto subsequent chunks of pixels.

A 512x512 frame.

A 512x512 frame.

Warp IDs rendering a post-processing shader for the frame. Notice the structures of 4x8 pixels, these are individual warps. A diagonal line is visible due to the rasterization of a fullscreen quad.

Warp IDs rendering a post-processing shader for the frame. Notice the structures of 4x8 pixels, these are individual warps. A diagonal line is visible due to the rasterization of a fullscreen quad.

The hypothetical maximum amount of the frame (in red) that could be active at once if all warps were active at the same time. This will not be achieved in reality. Furthermore, the way that pixels are grouped together and processed will be more chunky, like on the left.

The hypothetical maximum amount of the frame (in red) that could be active at once if all warps were active at the same time. This will not be achieved in reality. Furthermore, the way that pixels are grouped together and processed will be more chunky, like on the left.

I make this point about how the frame is only ever partially active, because it means that, at some level, chunks of the frame are rendered sequentially one-after-the-another. This presents an opportunity for optimization, if certain parts of the frame are very cheap and can be computed quickly, the warps computing them can be freed up and move on to doing worthwhile work. This is exactly the case in frames with large portions of sky.

A Simple Example

Sky pixels typically involve completely separate, usually cheaper, code-paths than the ones that render shaded terrain.

An overlay mask showing which pixels render sky, and which pixels render terrain.

An overlay mask showing which pixels render sky, and which pixels render terrain.

Consider these two examples of a fragment shader which render the same image:

void main() {
    ...
    vec3 skyColor = ComputeSkyColor(...); // Cheap
    vec3 terrainColor = ComputeShadedTerrain(...); // Expensive
    
    vec3 color = mix(terrainColor, skyColor, vec3(pixelIsSky));
    
    bufferColor = color;
}
void main() {
    ...
    vec3 color;
    
    if (pixelIsSky) {
        color = ComputeSkyColor(...); // Cheap
    } else {
        color = ComputeShadedTerrain(...); // Expensive
    }
    
    bufferColor = color;
}

The naive version of the shader will execute both functions on all threads, all the time, no matter what. Whereas the optimized version will pick which function needs to be executed for the given thread, and only execute that function. If we forget about warps for a second and pretend that all threads are executing sequentially, then the optimized version of the shader would save ~25% of the time spent on the terrain shading, because ~25% of the pixels are sky pixels.

However, because threads execute in warps, there will be some portion of warps containing both terrain and sky; these warps will have to execute both code paths.

An overlay showing which contain sky pixels (blue), terrain pixels (green), or both (turquoise)

An overlay showing which contain sky pixels (blue), terrain pixels (green), or both (turquoise)

Wasted sky pixels inside of turquoise warps.

Wasted sky pixels inside of turquoise warps.

The sky pixels inside of turquoise warps will have their registers masked out while the terrain code is running, and will basically have to wait for the slowest codepath in the warp to finish executing. This does not make a big difference for this example however, because most of the sky pixels are inside of blue warps. In other words, the conditional branch is very coherent.