Frame Graph — Theory

Rendering Architecture - This article is part of a series.

Part : This Article

Navigation

0% read

📖 Part I of III. Theory → Build It → Production Engines

🎯 Why You Want One
#

Passes run in whatever order you wrote them. → Sorted by dependencies. Every GPU sync point placed by hand. → Barriers inserted for you. Each pass allocates its own memory — 900 MB gone. → Resources shared safely — ~450 MB back.

You describe what each pass needs — the graph figures out the how.

Frostbite introduced it at GDC 2017. UE5 ships it as RDG. Every major renderer uses one — this series shows you why, walks you through building your own in C++, and maps every piece to what ships in production engines.

�

Learn Theory

What a frame graph is, why every engine uses one, and how each piece works

🔨

Build MVP

Working C++ frame graph, from scratch to prototype in ~300 lines

🗺️

Map to UE5

Every piece maps to RDG — read the source with confidence

🔥 The Problem
#

Month 1 — 3 passes, everything's fine

Depth prepass → GBuffer → lighting. Two barriers, hand-placed. Two textures, both allocated at init. Code is clean, readable, correct.

At this scale, manual management works. You know every resource by name.

Month 6 — 12 passes, cracks appear

Same renderer, now with SSAO, SSR, bloom, TAA, shadow cascades. Three things going wrong simultaneously:

Invisible dependencies — someone adds SSAO but doesn't realize GBuffer needs an updated barrier. Visual artifacts on fresh build.

Wasted memory — SSAO and bloom textures never overlap, but aliasing them means auditing every pass that might touch them. Nobody does it.

Silent reordering — two branches touch the render loop. Git merges cleanly, but the shadow pass ends up after lighting. Subtly wrong output ships unnoticed.

No single change broke it. The accumulation broke it.

Month 18 — 25 passes, nobody touches it

The renderer works, but:

900 MB VRAM. Profiling shows 400 MB is aliasable — but the lifetime analysis would take a week and break the next time anyone adds a pass.

47 barrier calls. Three are redundant, two are missing, one is in the wrong queue. Nobody knows which.

2 days to add a new pass. 30 minutes for the shader, the rest to figure out where to slot it and what barriers it needs.

The renderer isn't wrong. It's fragile. Every change is a risk.

Month 1

Month 6

Month 18

Passes

Barriers

VRAM

~40 MB

380 MB

900 MB

Aliasable

~80 MB

400 MB

Status

✓ manageable

⚠ fragile

✗ untouchable

The pattern is always the same: manual resource management works at small scale and fails at compound scale. Not because engineers are sloppy — because no human tracks 25 lifetimes and 47 transitions in their head every sprint. You need a system that sees the whole frame at once.

💡 The Core Idea
#

A frame graph is a directed acyclic graph (DAG) — each node is a render pass, each edge is a resource one pass hands to the next. Here’s what a typical deferred frame looks like:

Depth
Prepassdepth

GBuffer
Passalbedo · normals · depth

SSAOocclusion

↕

LightingHDR color

Tonemap→ present

nodes = passes · edges = resource flow · arrows = write → read

You don’t execute this graph directly. Every frame goes through three steps — first you declare all the passes and what they read/write, then the system compiles an optimized plan (ordering, memory, barriers), and finally it executes the result:

① DECLARE

passes & dependencies

② COMPILE

order · aliases · barriers

③ EXECUTE

record GPU commands

Let’s look at each step.

📋 The Declare Step
#

Each frame starts on the CPU. You register passes, describe the resources they need, and declare who reads or writes what. No GPU work happens yet — you’re building a description of the frame.

📋 DECLARE — building the graph

ADD PASSES

addPass(setup, execute)

DECLARE RESOURCES

create({1920,1080, RGBA8})

WIRE DEPENDENCIES

read(h) / write(h)

CPU only — the GPU is idle during this phase

Handle #3

1920×1080 · RGBA8 · render target

description only — no GPU memory yet

Resources stay virtual at this stage — just a description and a handle. Memory comes later.

📦 Transient vs. imported
#

When you declare a resource, the graph needs to know one thing: does it live inside this frame, or does it come from outside?

⚡ Transient

Lifetime: single frame
Declared as: description (size, format)
GPU memory: allocated and aliased at compile
Aliasable: Yes — non-overlapping lifetimes share physical memory
Examples: GBuffer MRTs, SSAO scratch, bloom scratch

📌 Imported

Lifetime: across frames
Declared as: existing GPU handle
GPU memory: already allocated externally
Aliasable: No — lifetime extends beyond the frame
Examples: backbuffer, TAA history, shadow atlas, blue noise LUT

⚙️ The Compile Step
#

The declared DAG goes in; an optimized execution plan comes out — all on the CPU, in microseconds.

📥 In declared passes + virtual resources + read/write edges

①Sort passes into dependency order ②Cull passes whose outputs are never read ③Allocate — alias memory so non-overlapping lifetimes share physical blocks ④Barrier — insert transitions at every resource state change ⑤Bind — attach physical memory, creating or reusing from a pool

📤 Out ordered passes · aliased memory · barrier list · physical bindings

Sorting and culling
#

Sorting is a topological sort over the dependency edges, producing a linear order that respects every read-before-write constraint.

Culling walks backward from the final outputs and removes any pass whose results are never read. Dead-code elimination for GPU work — entire passes vanish without a feature flag.

Allocation and aliasing
#

The sorted order tells the compiler exactly when each resource is first written and last read — its lifetime. Two resources that are never alive at the same time can share the same physical memory.

Pass 1

Pass 2

Pass 3

Pass 4

Pass 5

Pass 6

GBuffer

Bloom

No overlap → same heap, two resources

The graph allocates a large ID3D12Heap (or VkDeviceMemory) and places multiple resources at different offsets within it. This is the single biggest VRAM win the graph provides.

⚠ Pitfalls

GarbageAliased memory has stale contents — first use must be a full clear or overwrite Transient onlyImported resources live across frames — only single-frame transients qualify SyncThe old resource must finish all GPU access before the new one touches the same memory

Production optimizations

🪣 BucketingRound sizes to power-of-two (4, 8, 16 MB…) — fewer distinct sizes means heaps are reusable across resources ♻️ PoolingKeep heaps across frames. Next frame's compile() pulls from the pool — allocation cost drops to near zero

Part III covers how UE5 and Frostbite implement these strategies.

Barriers
#

The compiler knows each resource’s state at every point — render target, shader read, copy source — and inserts a barrier at every transition. Hand-written barriers are one of the most common sources of GPU bugs; the graph makes them automatic and correct by construction.

▶️ The Execute Step
#

The plan is ready — now the GPU gets involved. Every decision has already been made during compile: pass order, memory layout, barriers, physical resource bindings. Execute just walks the plan.

▶️ EXECUTE — recording GPU commands

RUN PASSES

for each pass in compiled order:
insert barriers → call execute()

CLEANUP

release transients (or pool them)
reset the frame allocator

The only phase that touches the GPU API — resources already bound

Each execute lambda sees a fully resolved environment — barriers already placed, memory already allocated, resources ready to bind. The lambda just records draw calls, dispatches, and copies. All the intelligence lives in the compile step.

🔄 Rebuild Strategies
#

How often should the graph recompile? Three approaches, each a valid tradeoff:

🔄 Dynamic

Rebuild every frame.
Cost: microseconds
Flex: full — passes appear/disappear freely
Used by: Frostbite

⚡ Hybrid

Cache compiled result, invalidate on change.
Cost: near-zero on hit
Flex: full + bookkeeping
Used by: UE5

🔒 Static

Compile once at init, replay forever.
Cost: zero
Flex: none — fixed pipeline
Rare in practice

Most engines use dynamic or hybrid. The compile is so cheap that caching buys little — but some engines do it anyway to skip redundant barrier recalculation.

💰 The Payoff
#

❌ Without Graph

✅ With Graph

Memory aliasing
Opt-in, fragile, rarely done

Memory aliasing
Automatic — compiler sees all lifetimes. 30–50% VRAM saved.

Lifetimes
Manual create/destroy, leaked or over-retained

Lifetimes
Scoped to first..last use. Zero waste.

Barriers
Manual, per-pass

Barriers
Automatic from declared read/write

Pass reordering
Breaks silently

Pass reordering
Safe — compiler respects dependencies

Pass culling
Manual ifdef / flag checks

Pass culling
Automatic — unused outputs = dead pass

Async compute
Manual queue sync

Async compute
Compiler schedules across queues

🏭 Not theoretical. Frostbite reported 50% VRAM reduction from aliasing at GDC 2017. UE5's RDG ships the same optimization today — every FRDGTexture marked as transient goes through the same aliasing pipeline we build in Part II.

🔬 Advanced Features
#

The core graph handles scheduling, barriers, and aliasing — but the same DAG enables the compiler to go further. It can merge adjacent render passes to eliminate redundant state changes, schedule independent work across GPU queues, and split barrier transitions to hide cache-flush latency. Part II builds the core; Part III shows how production engines deploy all of these.

🔗 Pass Merging
#

Every render pass boundary has a cost — the GPU resolves attachments, flushes caches, stores intermediate results to memory, and sets up state for the next pass. When two adjacent passes share the same render targets, that boundary is pure overhead. Pass merging fuses compatible passes into a single API render pass, eliminating the round-trip entirely.

Without merging

Pass A GBuffer
│ render
│ store → VRAM ✗
└ done

Pass B Lighting
│ load ← VRAM ✗
│ render
└ done

2 render passes, 1 unnecessary round-trip

With merging

Pass A+B merged
│ render A
│ B reads in-place ✓
│ render B
└ store once → VRAM

1 render pass — no intermediate memory traffic

When can two passes merge? Three conditions, all required:
① Same render target dimensions
② Second pass reads the first's output at the current pixel only (no arbitrary UV sampling)
③ No external dependencies forcing a render pass break

Fewer render pass boundaries means fewer state changes, less barrier overhead, and the driver gets a larger scope to schedule work internally. D3D12 Render Pass Tier 2 hardware can eliminate intermediate stores for merged passes entirely — the GPU keeps data on-chip between subpasses instead of round-tripping through VRAM. Console GPUs benefit similarly, where the driver can batch state setup across fused passes.

⚡ Async Compute
#

Pass merging and barriers optimize work on a single GPU queue. But modern GPUs expose at least two: a graphics queue and a compute queue. If two passes have no dependency path between them in the DAG, the compiler can schedule them on different queues simultaneously.

🔍 Finding parallelism
#

The compiler needs to answer one question for every pair of passes: can these run at the same time? Two passes can overlap only if neither depends on the other — directly or indirectly. A pass that writes the GBuffer can’t overlap with lighting (which reads it), but it can overlap with SSAO if they share no resources.

The algorithm is called reachability analysis — for each pass, the compiler figures out every other pass it can eventually reach by following edges forward through the DAG. If pass A can reach pass B (or B can reach A), they’re dependent. If neither can reach the other, they’re independent — safe to run on separate queues.

🚧 Minimizing fences
#

Cross-queue work needs GPU fences — one queue signals, the other waits. Each fence costs ~5–15 µs of dead GPU time. Move SSAO, volumetrics, and particle sim to compute and you create six fences — up to 90 µs of idle that can erase the overlap gain. The compiler applies transitive reduction to collapse those down:

Naive — 4 fences

Graphics: [A] ──fence──→ [C]
             └──fence──→ [D]

Compute:  [B] ──fence──→ [C]
             └──fence──→ [D]

Every cross-queue edge gets its own fence

Reduced — 1 fence

Graphics: [A] ─────────→ [C] → [D]
                             ↑
Compute:  [B] ──fence──┘

B's fence covers both C and D
(D is after C on graphics queue)

Redundant fences removed transitively

⚖️ What makes overlap good or bad
#

Solving fences is the easy part — the compiler handles that. The harder question is whether overlapping two specific passes actually helps:

✓ Complementary

Graphics is ROP/rasterizer-bound (shadow rasterization, geometry-dense passes) while compute runs ALU-heavy shaders (SSAO, volumetrics). Different hardware units stay busy — real parallelism, measurable frame time reduction.

✗ Competing

Both passes are bandwidth-bound or both ALU-heavy — they thrash each other's L2 cache and fight for CU time. The frame gets slower than running them sequentially. Common trap: overlapping two fullscreen post-effects.

NVIDIA uses dedicated async engines. AMD exposes more independent CUs for overlap. But on both: always profile per-GPU — the overlap that wins on one architecture can regress on another.

Try it yourself — move compute-eligible passes between queues and see how fence count and frame time change:

🧭 Should this pass go async?
#

Compute-only? no → needs rasterization

yes ↓

Independent of graphics? no → shared resource

yes ↓

Complementary overlap? no → profile first

yes ↓

Enough work between fences? no → sync eats the gain

yes ↓

ASYNC COMPUTE ✓

Good candidates: SSAO alongside ROP-bound geometry, volumetrics during shadow rasterization, particle sim during UI.

✂️ Split Barriers
#

Async compute hides latency by overlapping work across queues. Split barriers achieve the same effect on a single queue — by spreading one resource transition across multiple passes instead of stalling on it.

A regular barrier does a cache flush, state change, and cache invalidate in one blocking command — the GPU finishes the source pass, stalls while the transition completes, then starts the next pass. Every microsecond of that stall is wasted.

A split barrier breaks the transition into two halves and spreads them apart:

Source pass

writes texture

BEGIN

flush caches

Pass C

unrelated work

Pass D

unrelated work

END

invalidate

Dest pass

reads texture

↑ cache flush runs in background while these execute ↑

The passes between begin and end are the overlap gap — they execute while the cache flush happens in the background. The compiler places these automatically: begin immediately after the source pass, end immediately before the destination.

📏 How much gap is enough?
#

passes

No gap — degenerates into a regular barrier with extra API cost

pass

Marginal — might not cover the full flush latency

passes

Cache flush fully hidden — measurable frame time reduction

⚡

cross-queue

Can't split across queues — use an async fence instead

Try it — drag the BEGIN marker left to widen the overlap gap and watch the stall disappear:

That's all the theory. Part II implements the core — barriers, culling, aliasing — in ~300 lines of C++. Part III shows how production engines deploy all of these at scale.

🎛️ Putting It All Together
#

You’ve now seen every piece the compiler works with — topological sorting, pass culling, barrier insertion, async compute scheduling, memory aliasing, split barriers. In a simple 5-pass pipeline these feel manageable. In a production renderer? You’re looking at 15–25 passes, 30+ resource edges, and dozens of implicit dependencies — all inferred from read() and write() calls that no human can hold in their head at once.

This is the trade-off at the heart of every render graph. Dependencies become implicit — the graph infers ordering from data flow, which means you never declare "pass A must run before pass B." That's powerful: the compiler can reorder, cull, and parallelize freely. But it also means dependencies are hidden. Miss a read() call and the graph silently reorders two passes that shouldn't overlap. Add an assert and you'll catch the symptom — but not the missing edge that caused it.

Since the frame graph is a DAG, every dependency is explicitly encoded in the structure. That means you can build tools to visualize the entire pipeline — every pass, every resource edge, every implicit ordering decision — something that’s impossible when barriers and ordering are scattered across hand-written render code.

The explorer below is a production-scale graph. Toggle each compiler feature on and off to see exactly what it contributes. Click any pass to inspect its dependencies — every edge was inferred from read() and write() calls, not hand-written.

Next: Part II — Build It →

Rendering Architecture - This article is part of a series.

Part : Frame Graph — Production Engines

Part : Frame Graph — Build It

Part : This Article

🎯 Why You Want One#

🔥 The Problem#

💡 The Core Idea#

📋 The Declare Step#

📦 Transient vs. imported#

⚙️ The Compile Step#

Sorting and culling#

Allocation and aliasing#

Barriers#

▶️ The Execute Step#

🔄 Rebuild Strategies#

💰 The Payoff#

🔬 Advanced Features#

🔗 Pass Merging#

⚡ Async Compute#

🔍 Finding parallelism#

🚧 Minimizing fences#

⚖️ What makes overlap good or bad#

🧭 Should this pass go async?#

✂️ Split Barriers#

📏 How much gap is enough?#

🎛️ Putting It All Together#

🎯 Why You Want One
#

🔥 The Problem
#

💡 The Core Idea
#

📋 The Declare Step
#

📦 Transient vs. imported
#

⚙️ The Compile Step
#

Sorting and culling
#

Allocation and aliasing
#

Barriers
#

▶️ The Execute Step
#

🔄 Rebuild Strategies
#

💰 The Payoff
#

🔬 Advanced Features
#

🔗 Pass Merging
#

⚡ Async Compute
#

🔍 Finding parallelism
#

🚧 Minimizing fences
#

⚖️ What makes overlap good or bad
#

🧭 Should this pass go async?
#

✂️ Split Barriers
#

📏 How much gap is enough?
#

🎛️ Putting It All Together
#