Skip to main content

Frame Graph — Theory

·2951 words·14 mins
Rendering Architecture - This article is part of a series.
Part : This Article
Frame Graph Part I 0%
Navigation
0% read
📖 Part I of III.TheoryBuild ItProduction Engines

🎯 Why You Want One
#

Passes run in whatever order you wrote them. Sorted by dependencies. Every GPU sync point placed by hand. Barriers inserted for you. Each pass allocates its own memory — 900 MB gone. Resources shared safely — ~450 MB back.
You describe what each pass needs — the graph figures out the how.

Frostbite introduced it at GDC 2017. UE5 ships it as RDG. Every major renderer uses one — this series shows you why, walks you through building your own in C++, and maps every piece to what ships in production engines.

Learn Theory
What a frame graph is, why every engine uses one, and how each piece works
🔨
Build MVP
Working C++ frame graph, from scratch to prototype in ~300 lines
🗺️
Map to UE5
Every piece maps to RDG — read the source with confidence

🔥 The Problem
#

1
Month 1 — 3 passes, everything's fine
Depth prepass → GBuffer → lighting. Two barriers, hand-placed. Two textures, both allocated at init. Code is clean, readable, correct.
At this scale, manual management works. You know every resource by name.
6
Month 6 — 12 passes, cracks appear
Same renderer, now with SSAO, SSR, bloom, TAA, shadow cascades. Three things going wrong simultaneously:
Invisible dependencies — someone adds SSAO but doesn't realize GBuffer needs an updated barrier. Visual artifacts on fresh build.
Wasted memory — SSAO and bloom textures never overlap, but aliasing them means auditing every pass that might touch them. Nobody does it.
Silent reordering — two branches touch the render loop. Git merges cleanly, but the shadow pass ends up after lighting. Subtly wrong output ships unnoticed.
No single change broke it. The accumulation broke it.
18
Month 18 — 25 passes, nobody touches it
The renderer works, but:
900 MB VRAM. Profiling shows 400 MB is aliasable — but the lifetime analysis would take a week and break the next time anyone adds a pass.
47 barrier calls. Three are redundant, two are missing, one is in the wrong queue. Nobody knows which.
2 days to add a new pass. 30 minutes for the shader, the rest to figure out where to slot it and what barriers it needs.
The renderer isn't wrong. It's fragile. Every change is a risk.
Month 1
Month 6
Month 18
Passes
3
12
25
Barriers
2
18
47
VRAM
~40 MB
380 MB
900 MB
Aliasable
0
~80 MB
400 MB
Status
✓ manageable
⚠ fragile
✗ untouchable

The pattern is always the same: manual resource management works at small scale and fails at compound scale. Not because engineers are sloppy — because no human tracks 25 lifetimes and 47 transitions in their head every sprint. You need a system that sees the whole frame at once.


💡 The Core Idea
#

A frame graph is a directed acyclic graph (DAG) — each node is a render pass, each edge is a resource one pass hands to the next. Here’s what a typical deferred frame looks like:

Depth
Prepassdepth
GBuffer
Passalbedo · normals · depth
SSAOocclusion
LightingHDR color
Tonemap→ present
nodes = passes  ·  edges = resource flow  ·  arrows = write → read

You don’t execute this graph directly. Every frame goes through three steps — first you declare all the passes and what they read/write, then the system compiles an optimized plan (ordering, memory, barriers), and finally it executes the result:

Let’s look at each step.


📋 The Declare Step
#

Each frame starts on the CPU. You register passes, describe the resources they need, and declare who reads or writes what. No GPU work happens yet — you’re building a description of the frame.

📋 DECLARE — building the graph
ADD PASSES
  • addPass(setup, execute)
DECLARE RESOURCES
  • create({1920,1080, RGBA8})
WIRE DEPENDENCIES
  • read(h) / write(h)
CPU only — the GPU is idle during this phase
Handle #3
1920×1080 · RGBA8 · render target
description only — no GPU memory yet
Resources stay virtual at this stage — just a description and a handle. Memory comes later.

📦 Transient vs. imported
#

When you declare a resource, the graph needs to know one thing: does it live inside this frame, or does it come from outside?

⚡ Transient
Lifetime: single frame
Declared as: description (size, format)
GPU memory: allocated and aliased at compile
Aliasable: Yes — non-overlapping lifetimes share physical memory
Examples: GBuffer MRTs, SSAO scratch, bloom scratch
📌 Imported
Lifetime: across frames
Declared as: existing GPU handle
GPU memory: already allocated externally
Aliasable: No — lifetime extends beyond the frame
Examples: backbuffer, TAA history, shadow atlas, blue noise LUT

⚙️ The Compile Step
#

The declared DAG goes in; an optimized execution plan comes out — all on the CPU, in microseconds.

📥 In declared passes + virtual resources + read/write edges
Sort passes into dependency order Cull passes whose outputs are never read Allocate — alias memory so non-overlapping lifetimes share physical blocks Barrier — insert transitions at every resource state change Bind — attach physical memory, creating or reusing from a pool
📤 Out ordered passes · aliased memory · barrier list · physical bindings

Sorting and culling
#

Sorting is a topological sort over the dependency edges, producing a linear order that respects every read-before-write constraint.

Culling walks backward from the final outputs and removes any pass whose results are never read. Dead-code elimination for GPU work — entire passes vanish without a feature flag.

Allocation and aliasing
#

The sorted order tells the compiler exactly when each resource is first written and last read — its lifetime. Two resources that are never alive at the same time can share the same physical memory.

Pass 1
Pass 2
Pass 3
Pass 4
Pass 5
Pass 6
GBuffer
Bloom
No overlap → same heap, two resources
The graph allocates a large ID3D12Heap (or VkDeviceMemory) and places multiple resources at different offsets within it. This is the single biggest VRAM win the graph provides.
⚠ Pitfalls
GarbageAliased memory has stale contents — first use must be a full clear or overwrite Transient onlyImported resources live across frames — only single-frame transients qualify SyncThe old resource must finish all GPU access before the new one touches the same memory
Production optimizations
🪣 BucketingRound sizes to power-of-two (4, 8, 16 MB…) — fewer distinct sizes means heaps are reusable across resources ♻️ PoolingKeep heaps across frames. Next frame's compile() pulls from the pool — allocation cost drops to near zero
Part III covers how UE5 and Frostbite implement these strategies.

Barriers
#

The compiler knows each resource’s state at every point — render target, shader read, copy source — and inserts a barrier at every transition. Hand-written barriers are one of the most common sources of GPU bugs; the graph makes them automatic and correct by construction.


▶️ The Execute Step
#

The plan is ready — now the GPU gets involved. Every decision has already been made during compile: pass order, memory layout, barriers, physical resource bindings. Execute just walks the plan.

▶️ EXECUTE — recording GPU commands
RUN PASSES
  • for each pass in compiled order:
  • insert barriers → call execute()
CLEANUP
  • release transients (or pool them)
  • reset the frame allocator
The only phase that touches the GPU API — resources already bound
Each execute lambda sees a fully resolved environment — barriers already placed, memory already allocated, resources ready to bind. The lambda just records draw calls, dispatches, and copies. All the intelligence lives in the compile step.

🔄 Rebuild Strategies
#

How often should the graph recompile? Three approaches, each a valid tradeoff:

🔄 Dynamic
Rebuild every frame.
Cost: microseconds
Flex: full — passes appear/disappear freely
Used by: Frostbite
⚡ Hybrid
Cache compiled result, invalidate on change.
Cost: near-zero on hit
Flex: full + bookkeeping
Used by: UE5
🔒 Static
Compile once at init, replay forever.
Cost: zero
Flex: none — fixed pipeline
Rare in practice

Most engines use dynamic or hybrid. The compile is so cheap that caching buys little — but some engines do it anyway to skip redundant barrier recalculation.


💰 The Payoff
#

❌ Without Graph
✅ With Graph
Memory aliasing
Opt-in, fragile, rarely done
Memory aliasing
Automatic — compiler sees all lifetimes. 30–50% VRAM saved.
Lifetimes
Manual create/destroy, leaked or over-retained
Lifetimes
Scoped to first..last use. Zero waste.
Barriers
Manual, per-pass
Barriers
Automatic from declared read/write
Pass reordering
Breaks silently
Pass reordering
Safe — compiler respects dependencies
Pass culling
Manual ifdef / flag checks
Pass culling
Automatic — unused outputs = dead pass
Async compute
Manual queue sync
Async compute
Compiler schedules across queues
🏭 Not theoretical. Frostbite reported 50% VRAM reduction from aliasing at GDC 2017. UE5's RDG ships the same optimization today — every FRDGTexture marked as transient goes through the same aliasing pipeline we build in Part II.

🔬 Advanced Features
#

The core graph handles scheduling, barriers, and aliasing — but the same DAG enables the compiler to go further. It can merge adjacent render passes to eliminate redundant state changes, schedule independent work across GPU queues, and split barrier transitions to hide cache-flush latency. Part II builds the core; Part III shows how production engines deploy all of these.

🔗 Pass Merging
#

Every render pass boundary has a cost — the GPU resolves attachments, flushes caches, stores intermediate results to memory, and sets up state for the next pass. When two adjacent passes share the same render targets, that boundary is pure overhead. Pass merging fuses compatible passes into a single API render pass, eliminating the round-trip entirely.
Without merging
Pass A GBuffer
render
store → VRAM ✗
done

Pass B Lighting
load ← VRAM ✗
render
done
2 render passes, 1 unnecessary round-trip
With merging
Pass A+B merged
render A
B reads in-place ✓
render B
store once → VRAM
1 render pass — no intermediate memory traffic
When can two passes merge? Three conditions, all required:
Same render target dimensions
Second pass reads the first's output at the current pixel only (no arbitrary UV sampling)
No external dependencies forcing a render pass break

Fewer render pass boundaries means fewer state changes, less barrier overhead, and the driver gets a larger scope to schedule work internally. D3D12 Render Pass Tier 2 hardware can eliminate intermediate stores for merged passes entirely — the GPU keeps data on-chip between subpasses instead of round-tripping through VRAM. Console GPUs benefit similarly, where the driver can batch state setup across fused passes.

⚡ Async Compute
#

Pass merging and barriers optimize work on a single GPU queue. But modern GPUs expose at least two: a graphics queue and a compute queue. If two passes have no dependency path between them in the DAG, the compiler can schedule them on different queues simultaneously.

🔍 Finding parallelism
#

The compiler needs to answer one question for every pair of passes: can these run at the same time? Two passes can overlap only if neither depends on the other — directly or indirectly. A pass that writes the GBuffer can’t overlap with lighting (which reads it), but it can overlap with SSAO if they share no resources.

The algorithm is called reachability analysis — for each pass, the compiler figures out every other pass it can eventually reach by following edges forward through the DAG. If pass A can reach pass B (or B can reach A), they’re dependent. If neither can reach the other, they’re independent — safe to run on separate queues.

🚧 Minimizing fences
#

Cross-queue work needs GPU fences — one queue signals, the other waits. Each fence costs ~5–15 µs of dead GPU time. Move SSAO, volumetrics, and particle sim to compute and you create six fences — up to 90 µs of idle that can erase the overlap gain. The compiler applies transitive reduction to collapse those down:

Naive — 4 fences
Graphics: [A] ──fence──→ [C]
             └──fence──→ [D]

Compute:  [B] ──fence──→ [C]
             └──fence──→ [D]
Every cross-queue edge gets its own fence
Reduced — 1 fence
Graphics: [A] ─────────→ [C] → [D]
                             
Compute:  [B] ──fence──┘

B's fence covers both C and D
(D is after C on graphics queue)
Redundant fences removed transitively

⚖️ What makes overlap good or bad
#

Solving fences is the easy part — the compiler handles that. The harder question is whether overlapping two specific passes actually helps:

✓ Complementary
Graphics is ROP/rasterizer-bound (shadow rasterization, geometry-dense passes) while compute runs ALU-heavy shaders (SSAO, volumetrics). Different hardware units stay busy — real parallelism, measurable frame time reduction.
✗ Competing
Both passes are bandwidth-bound or both ALU-heavy — they thrash each other's L2 cache and fight for CU time. The frame gets slower than running them sequentially. Common trap: overlapping two fullscreen post-effects.
NVIDIA uses dedicated async engines. AMD exposes more independent CUs for overlap. But on both: always profile per-GPU — the overlap that wins on one architecture can regress on another.

Try it yourself — move compute-eligible passes between queues and see how fence count and frame time change:

🔬 Interactive: Async Compute Scheduling

Move compute-eligible passes (blue) to the Compute queue to overlap work. Yellow dashed lines = fences (cross-queue sync points).

🧭 Should this pass go async?
#

Compute-only? no → needs rasterization
yes ↓
Independent of graphics? no → shared resource
yes ↓
Complementary overlap? no → profile first
yes ↓
Enough work between fences? no → sync eats the gain
yes ↓
ASYNC COMPUTE ✓
Good candidates: SSAO alongside ROP-bound geometry, volumetrics during shadow rasterization, particle sim during UI.

✂️ Split Barriers
#

Async compute hides latency by overlapping work across queues. Split barriers achieve the same effect on a single queue — by spreading one resource transition across multiple passes instead of stalling on it.

A regular barrier does a cache flush, state change, and cache invalidate in one blocking command — the GPU finishes the source pass, stalls while the transition completes, then starts the next pass. Every microsecond of that stall is wasted.

A split barrier breaks the transition into two halves and spreads them apart:

Source pass
writes texture
BEGIN
flush caches
Pass C
unrelated work
Pass D
unrelated work
END
invalidate
Dest pass
reads texture
↑ cache flush runs in background while these execute ↑

The passes between begin and end are the overlap gap — they execute while the cache flush happens in the background. The compiler places these automatically: begin immediately after the source pass, end immediately before the destination.

📏 How much gap is enough?
#

0
passes
No gap — degenerates into a regular barrier with extra API cost
1
pass
Marginal — might not cover the full flush latency
2+
passes
Cache flush fully hidden — measurable frame time reduction
cross-queue
Can't split across queues — use an async fence instead

Try it — drag the BEGIN marker left to widen the overlap gap and watch the stall disappear:

🔬 Interactive: Split Barriers

A barrier transitions a resource between states (e.g. render target → shader read). A regular barrier stalls the GPU at the transition point. A split barrier spreads it over a gap — the GPU overlaps other work in between. Drag the BEGIN marker left to widen the gap. Click any pass to reassign producer / consumer.

That's all the theory. Part II implements the core — barriers, culling, aliasing — in ~300 lines of C++. Part III shows how production engines deploy all of these at scale.

🎛️ Putting It All Together
#

You’ve now seen every piece the compiler works with — topological sorting, pass culling, barrier insertion, async compute scheduling, memory aliasing, split barriers. In a simple 5-pass pipeline these feel manageable. In a production renderer? You’re looking at 15–25 passes, 30+ resource edges, and dozens of implicit dependencies — all inferred from read() and write() calls that no human can hold in their head at once.

This is the trade-off at the heart of every render graph. Dependencies become implicit — the graph infers ordering from data flow, which means you never declare "pass A must run before pass B." That's powerful: the compiler can reorder, cull, and parallelize freely. But it also means dependencies are hidden. Miss a read() call and the graph silently reorders two passes that shouldn't overlap. Add an assert and you'll catch the symptom — but not the missing edge that caused it.

Since the frame graph is a DAG, every dependency is explicitly encoded in the structure. That means you can build tools to visualize the entire pipeline — every pass, every resource edge, every implicit ordering decision — something that’s impossible when barriers and ordering are scattered across hand-written render code.

The explorer below is a production-scale graph. Toggle each compiler feature on and off to see exactly what it contributes. Click any pass to inspect its dependencies — every edge was inferred from read() and write() calls, not hand-written.

🎛️ Interactive: Full Pipeline Explorer

The minimap shows the full render graph. Click any pass to focus — the detail view shows its neighbors, resources, and how the compiler transforms it.

Pipeline:
Full Pipeline — click any pass to focus
Select a pass above

Rendering Architecture - This article is part of a series.
Part : This Article