🎯 Why You Want One#
Frostbite introduced it at GDC 2017. UE5 ships it as RDG. Every major renderer uses one — this series shows you why, walks you through building your own in C++, and maps every piece to what ships in production engines.
🔥 The Problem#
The pattern is always the same: manual resource management works at small scale and fails at compound scale. Not because engineers are sloppy — because no human tracks 25 lifetimes and 47 transitions in their head every sprint. You need a system that sees the whole frame at once.
💡 The Core Idea#
A frame graph is a directed acyclic graph (DAG) — each node is a render pass, each edge is a resource one pass hands to the next. Here’s what a typical deferred frame looks like:
Prepassdepth
Passalbedo · normals · depth
You don’t execute this graph directly. Every frame goes through three steps — first you declare all the passes and what they read/write, then the system compiles an optimized plan (ordering, memory, barriers), and finally it executes the result:
Let’s look at each step.
📋 The Declare Step#
Each frame starts on the CPU. You register passes, describe the resources they need, and declare who reads or writes what. No GPU work happens yet — you’re building a description of the frame.
addPass(setup, execute)
create({1920,1080, RGBA8})
read(h)/write(h)
📦 Transient vs. imported#
When you declare a resource, the graph needs to know one thing: does it live inside this frame, or does it come from outside?
⚙️ The Compile Step#
The declared DAG goes in; an optimized execution plan comes out — all on the CPU, in microseconds.
Sorting and culling#
Sorting is a topological sort over the dependency edges, producing a linear order that respects every read-before-write constraint.
Culling walks backward from the final outputs and removes any pass whose results are never read. Dead-code elimination for GPU work — entire passes vanish without a feature flag.
Allocation and aliasing#
The sorted order tells the compiler exactly when each resource is first written and last read — its lifetime. Two resources that are never alive at the same time can share the same physical memory.
Barriers#
The compiler knows each resource’s state at every point — render target, shader read, copy source — and inserts a barrier at every transition. Hand-written barriers are one of the most common sources of GPU bugs; the graph makes them automatic and correct by construction.
▶️ The Execute Step#
The plan is ready — now the GPU gets involved. Every decision has already been made during compile: pass order, memory layout, barriers, physical resource bindings. Execute just walks the plan.
- for each pass in compiled order:
- insert barriers → call
execute()
- release transients (or pool them)
- reset the frame allocator
🔄 Rebuild Strategies#
How often should the graph recompile? Three approaches, each a valid tradeoff:
Most engines use dynamic or hybrid. The compile is so cheap that caching buys little — but some engines do it anyway to skip redundant barrier recalculation.
💰 The Payoff#
FRDGTexture marked as transient goes through the same aliasing pipeline we build in Part II.🔬 Advanced Features#
The core graph handles scheduling, barriers, and aliasing — but the same DAG enables the compiler to go further. It can merge adjacent render passes to eliminate redundant state changes, schedule independent work across GPU queues, and split barrier transitions to hide cache-flush latency. Part II builds the core; Part III shows how production engines deploy all of these.
🔗 Pass Merging#
│ render
│ store → VRAM ✗
└ done
Pass B Lighting
│ load ← VRAM ✗
│ render
└ done
│ render A
│ B reads in-place ✓
│ render B
└ store once → VRAM
① Same render target dimensions
② Second pass reads the first's output at the current pixel only (no arbitrary UV sampling)
③ No external dependencies forcing a render pass break
Fewer render pass boundaries means fewer state changes, less barrier overhead, and the driver gets a larger scope to schedule work internally. D3D12 Render Pass Tier 2 hardware can eliminate intermediate stores for merged passes entirely — the GPU keeps data on-chip between subpasses instead of round-tripping through VRAM. Console GPUs benefit similarly, where the driver can batch state setup across fused passes.
⚡ Async Compute#
Pass merging and barriers optimize work on a single GPU queue. But modern GPUs expose at least two: a graphics queue and a compute queue. If two passes have no dependency path between them in the DAG, the compiler can schedule them on different queues simultaneously.
🔍 Finding parallelism#
The compiler needs to answer one question for every pair of passes: can these run at the same time? Two passes can overlap only if neither depends on the other — directly or indirectly. A pass that writes the GBuffer can’t overlap with lighting (which reads it), but it can overlap with SSAO if they share no resources.
The algorithm is called reachability analysis — for each pass, the compiler figures out every other pass it can eventually reach by following edges forward through the DAG. If pass A can reach pass B (or B can reach A), they’re dependent. If neither can reach the other, they’re independent — safe to run on separate queues.
🚧 Minimizing fences#
Cross-queue work needs GPU fences — one queue signals, the other waits. Each fence costs ~5–15 µs of dead GPU time. Move SSAO, volumetrics, and particle sim to compute and you create six fences — up to 90 µs of idle that can erase the overlap gain. The compiler applies transitive reduction to collapse those down:
⚖️ What makes overlap good or bad#
Solving fences is the easy part — the compiler handles that. The harder question is whether overlapping two specific passes actually helps:
Try it yourself — move compute-eligible passes between queues and see how fence count and frame time change:
🧭 Should this pass go async?#
✂️ Split Barriers#
Async compute hides latency by overlapping work across queues. Split barriers achieve the same effect on a single queue — by spreading one resource transition across multiple passes instead of stalling on it.
A regular barrier does a cache flush, state change, and cache invalidate in one blocking command — the GPU finishes the source pass, stalls while the transition completes, then starts the next pass. Every microsecond of that stall is wasted.
A split barrier breaks the transition into two halves and spreads them apart:
The passes between begin and end are the overlap gap — they execute while the cache flush happens in the background. The compiler places these automatically: begin immediately after the source pass, end immediately before the destination.
📏 How much gap is enough?#
Try it — drag the BEGIN marker left to widen the overlap gap and watch the stall disappear:
🎛️ Putting It All Together#
You’ve now seen every piece the compiler works with — topological sorting, pass culling, barrier insertion, async compute scheduling, memory aliasing, split barriers. In a simple 5-pass pipeline these feel manageable. In a production renderer? You’re looking at 15–25 passes, 30+ resource edges, and dozens of implicit dependencies — all inferred from read() and write() calls that no human can hold in their head at once.
read() call and the graph silently reorders two passes that shouldn't overlap. Add an assert and you'll catch the symptom — but not the missing edge that caused it.Since the frame graph is a DAG, every dependency is explicitly encoded in the structure. That means you can build tools to visualize the entire pipeline — every pass, every resource edge, every implicit ordering decision — something that’s impossible when barriers and ordering are scattered across hand-written render code.
The explorer below is a production-scale graph. Toggle each compiler feature on and off to see exactly what it contributes. Click any pass to inspect its dependencies — every edge was inferred from read() and write() calls, not hand-written.
