SMAA Implementation Example with Vireo RHI

SMAA / Vireo RHI / C++ / Vulkan / DirectX 12 / Slang Shader Langage / Compute Shaders / Deferred Rendering Henri Michelon

Context

This article details the SMAA implementation in the Vireo RHI Samples examples of the Vireo Rendering Hardware Interface project.

What is SMAA?

Anti-aliasing is one of the fundamental challenges of real-time rendering. SMAA (Subpixel Morphological Anti-Aliasing) offers a post-process approach that operates on the final rendered result: the image is processed after rendering, without using geometry (as MSAA does).

The principle is morphological: the shape of edges is detected to derive adaptive blending weights, which provides quality close to 4× MSAA at a much lower cost.

In this example, SMAA is integrated into the PostProcessing module alongside FXAA and TAA. Its implementation is broken down into three compute shader passes, implemented in C++23 and Slang. Compared to a full-screen quad approach in the graphics pipeline, the compute implementation avoids beginRendering/endRendering passes and intermediate render targets: the outputs of each pass are read/write images (createReadWriteImage) accessed directly via READWRITE_IMAGE descriptors.

The SMAA Pipeline

SMAA requires three passes to reach the final result:

Pass 1 Edge Detection smaaEdgeImage · R16G16_SFLOAT · READWRITE_IMAGE
Pass 2 Blending Weights smaaBlendImage · R16G16_SFLOAT · READWRITE_IMAGE
Pass 3 Neighborhood Blending smaaColorImage · final format · READWRITE_IMAGE

Each pass is a compute shader dispatching a grid of work groups covering the entire image (groups of 8×8 threads). The shaders are written in Slang, a modern shading language compiled to SPIR-V (Vulkan) or DXIL (DirectX 12).

Unlike the full-screen quad approach in the graphics pipeline, no vertex shader or vertex buffer is required. Intermediate buffers are read/write images (createReadWriteImage) rather than render targets, which eliminates RENDER_TARGET_COLOR transitions and beginRendering/endRendering calls.

Pass 1 — Edge Detection

This pass receives the color image output from the rendering pipeline (inputImage) and produces an image encoding the intensity of horizontal and vertical edges for each pixel. The compute shader is dispatched with groups of 8×8 threads, each thread processing one pixel.

#include "postprocess.inc.slang"

ConstantBuffer<Params> params       : register(b0);
Texture2D        inputImage         : register(t1);
RWTexture2D<float4> edgeOutput      : register(u2);
SamplerState     sampler            : register(SAMPLER_LINEAR_EDGE, space1);

[numthreads(8, 8, 1)]
void computeMain(uint3 dispatchId : SV_DispatchThreadID) {
    float2 uv = (dispatchId.xy + 0.5) / params.imageSize;

    // Coefficients from the ITU-R BT.601 perceptual luminance formula
    float3 lumaWeight = float3(0.299, 0.587, 0.114);

    // Sampling of the central pixel and its 4 direct neighbors
    float lumaC = dot(inputImage.Sample(sampler, uv).rgb,                            lumaWeight);
    float lumaN = dot(inputImage.Sample(sampler, uv + float2( 0, -1) / params.imageSize).rgb, lumaWeight);
    float lumaS = dot(inputImage.Sample(sampler, uv + float2( 0,  1) / params.imageSize).rgb, lumaWeight);
    float lumaW = dot(inputImage.Sample(sampler, uv + float2(-1,  0) / params.imageSize).rgb, lumaWeight);
    float lumaE = dot(inputImage.Sample(sampler, uv + float2( 1,  0) / params.imageSize).rgb, lumaWeight);

    // Horizontal gradient: left-center difference + center-right difference
    float edgeH = abs(lumaW - lumaC) + abs(lumaC - lumaE);
    // Vertical gradient:   top-center difference  + center-bottom difference
    float edgeV = abs(lumaN - lumaC) + abs(lumaC - lumaS);

    // Direct write: R = horizontal edge, G = vertical edge
    edgeOutput[dispatchId.xy] = float4(edgeH, edgeV, 0, 0);
}

Compared to a fragment shader, the compute shader receives its pixel position via SV_DispatchThreadID and writes directly into RWTexture2D via edgeOutput[dispatchId.xy], without going through a render target or a color attachment. UV coordinates are reconstructed by dividing by params.imageSize.

Working in luminance rather than RGB offers two advantages: fewer samples are needed (a scalar instead of a 3-float vector) and better correlation with human perception. The coefficients (0.299, 0.587, 0.114) come from the BT.601 standard, which is standard in computer graphics.

The sampler used here is SAMPLER_LINEAR_EDGE (addressing mode clamp-to-edge, bilinear filtering). This ensures that pixels at image borders do not read out-of-bounds texels.

The output image smaaEdgeImage is allocated as R16G16_SFLOAT, i.e. two 16-bit channels. The Red channel is used for the horizontal gradient and the Green channel for the vertical gradient.

Edge Detection example (Lysa Engine)
Edge Detection example (Lysa Engine)

Pass 2 — Blending Weight Calculation

This pass consumes the image produced in pass 1 and calculates a scalar blending weight for each pixel. It reads edgeImage via a Texture2D (read-only access) and writes the result into blendOutput, a RWTexture2D.

#include "postprocess.inc.slang"

ConstantBuffer<Params> params     : register(b0);
Texture2D        edgeImage        : register(t1);
RWTexture2D<float4> blendOutput  : register(u2);

[numthreads(8, 8, 1)]
void computeMain(uint3 dispatchId : SV_DispatchThreadID) {
    // Read H and V edges from pass 1 (integer coordinates)
    float2 edge = edgeImage[dispatchId.xy].rg;

    // The weight is the maximum of the two directions, clamped to [0, 1]
    float weight = saturate(max(edge.r, edge.g));

    blendOutput[dispatchId.xy] = float4(weight, weight, weight, 0);
}

max(edgeH, edgeV) selects the dominant direction of the edge. A pixel located at the intersection of a strong edge in both directions will receive a maximum weight, which is the desired behavior since this pixel will benefit more from smoothing.

The saturate(x) function is the GPU equivalent of the C++ function clamp(x, 0, 1).

Unlike pass 1, no sampler is used here: edgeImage[dispatchId.xy] performs a direct access by integer coordinates (equivalent to nearest-neighbor without filtering), which is exact since edge data is read pixel by pixel. Out-of-bounds accesses return zero thanks to the default behavior of Texture2D in border mode.

Blending Weight example (Lysa Engine)
Blending Weight example de mélange (Lysa Engine)

Pass 3 — Neighborhood Blending

The final pass combines the original image with its immediate neighbors, proportionally to the weights calculated in pass 2.

#include "postprocess.inc.slang"

ConstantBuffer<Params> params      : register(b0);
Texture2D        inputImage       : register(t1);
Texture2D        blendBuffer      : register(t2);  // Output from pass 2
SamplerState     sampler          : register(SAMPLER_NEAREST_BORDER, space1);

float4 fragmentMain(VertexOutput input) : SV_TARGET {
    float4 color = inputImage.Sample(sampler, input.uv);      // Central pixel
    float2 blend = blendBuffer.Sample(sampler, input.uv).rg;  // H and V weights

    float4 n = inputImage.Sample(sampler, input.uv + float2( 0, -1) / params.imageSize);
    float4 e = inputImage.Sample(sampler, input.uv + float2( 1,  0) / params.imageSize);

    // Vertical interpolation: blend with North neighbor
    float4 blended = lerp(color, n, blend.r);
    // Horizontal interpolation: blend with East neighbor
    blended = lerp(blended, e, blend.g);

    return blended;
}

This pass is the only one to use the smaaDescriptorLayout layout with three bindings: the parameters buffer (b0), the original color image (t1), and the weight buffer (t2).

The first interpolation (lerp) blends the current pixel with its North neighbor proportionally to blend.r (horizontal edge detected = top/bottom transition). The second blends the result with the East neighbor according to blend.g.

Where there is no edge (blend ≈ 0), the pixel is returned unchanged. Where a strong edge is detected (blend ≈ 1), the pixel is heavily interpolated with its neighbor in the direction perpendicular to the edge.

Integration into the PostProcessing Pipeline

The samples.common.postprocessing module (PostProcessing.ixx and PostProcessing.cpp) manages the entire lifecycle of the post-processing passes.

Pipeline initialization (onInit)

// Creation of the three SMAA graphics pipelines
pipelineConfig.colorRenderFormats.push_back(vireo::ImageFormat::R16G16_SFLOAT);
pipelineConfig.resources = resources;  // standard layout (2 bindings)

pipelineConfig.fragmentShader = vireo->createShaderModule("shaders/smaa_edge_detect.frag");
smaaEdgePipeline = vireo->createGraphicPipeline(pipelineConfig);

pipelineConfig.fragmentShader = vireo->createShaderModule("shaders/smaa_blend_weigth.frag");
smaaBlendWeightPipeline = vireo->createGraphicPipeline(pipelineConfig);

// Pass 3 has a different format and layout
pipelineConfig.colorRenderFormats[0] = renderFormat;   // final image format
pipelineConfig.resources = smaaResources;               // layout with 3 bindings
pipelineConfig.fragmentShader = vireo->createShaderModule("shaders/smaa_neighborhood_blend.frag");
smaaBlendPipeline = vireo->createGraphicPipeline(pipelineConfig);

Two different descriptor layouts are created: descriptorLayout (2 bindings: params + input) for passes 1 and 2, and smaaDescriptorLayout (3 bindings: params + input + blendBuffer) for pass 3.

Render loop (onRender)

The SMAA pass in onRender is integrated alongside the other post-processing passes:

// === Pass 1: edge detection ===
cmdList->barrier(colorInput,          // source image → shader read
    vireo::ResourceState::RENDER_TARGET_COLOR,
    vireo::ResourceState::SHADER_READ);
cmdList->barrier(frame.smaaEdgeBuffer,  // edge buffer → render target
    vireo::ResourceState::UNDEFINED,
    vireo::ResourceState::RENDER_TARGET_COLOR);
frame.smaaEdgeDescriptorSet->update(BINDING_INPUT, colorInput->getImage());
// ... beginRendering → bindPipeline → draw(3) → endRendering

// === Pass 2: blending weights ===
cmdList->barrier(frame.smaaEdgeBuffer,  // edge buffer → shader read
    vireo::ResourceState::RENDER_TARGET_COLOR,
    vireo::ResourceState::SHADER_READ);
cmdList->barrier(frame.smaaBlendBuffer, // blend buffer → render target
    vireo::ResourceState::UNDEFINED,
    vireo::ResourceState::RENDER_TARGET_COLOR);
frame.smaaBlendWeightDescriptorSet->update(BINDING_INPUT, frame.smaaEdgeBuffer->getImage());
// ... beginRendering → bindPipeline → draw(3) → endRendering

// === Pass 3: neighborhood blending ===
cmdList->barrier(frame.smaaBlendBuffer, // blend buffer → shader read
    vireo::ResourceState::RENDER_TARGET_COLOR,
    vireo::ResourceState::SHADER_READ);
frame.smaaBlendDescriptorSet->update(BINDING_INPUT,      colorInput->getImage());
frame.smaaBlendDescriptorSet->update(BINDING_SMAA_INPUT, frame.smaaBlendBuffer->getImage());
// ... beginRendering → bindPipeline → draw(3) → endRendering

The pattern is identical for each pass: barrier on the source (transition to SHADER_READ), barrier on the destination (transition to RENDER_TARGET_COLOR), descriptor set update, then draw execution. Barriers ensure correct GPU synchronization between passes (making sure data is fully written into buffers before being read).

SMAA vs FXAA vs MSAA — Comparison

The Vireo RHI examples in this project integrate three anti-aliasing techniques:

FXAA (Fast Approximate Anti-Aliasing)

FXAA detects edges via local contrast and applies a directional blur in a single pass. Extremely fast, but may make the image slightly blurry, especially on textures.

SMAA (Subpixel Morphological AA)

SMAA analyzes edge morphology to produce more accurate blending weights. Quality is superior to FXAA, with less sharpness loss. The cost is higher (3 passes) but remains very acceptable on modern hardware.

TAA (Temporal Anti-Aliasing)

TAA exploits the history of previous frames to accumulate more information. Excellent quality on static geometry, but requires a velocity buffer and ghosting management for fast-moving objects.

The TAA + SMAA combination (supported in this example) is particularly interesting: TAA stabilizes the image temporally while SMAA refines spatial edges, each compensating for the other's weaknesses.

Conclusion

Several improvement directions are possible to get closer to "full" SMAA:

  • Use an edge search LUT in pass 2 to compute precise directional weights based on edge length
  • Implement stencil masking to process only pixels near an edge, reducing GPU load
  • Add the SMAA T2× variant (with temporal jitter) for even finer subpixel anti-aliasing