Skip to main content

Frame Graph: Theory

·19 mins
Frame Graph - This article is part of a series.
Part : This Article
📖 Part I of IV.TheoryBuild ItBeyond MVPProduction Engines

Behind every smooth frame is a complex scheduling problem: which passes can run in parallel, which buffers can reuse the same memory, and which barriers are actually necessary. Frame graphs solve it: declare what each pass reads and writes, and the graph handles the rest. This series breaks down the theory, builds a real implementation in C++, and shows how the same ideas scale to production engines like UE5’s RDG.

📖
Learn Theory
What a frame graph is, why every engine uses one, and how each piece works
🔨
Build MVP
Working C++ frame graph, from scratch to prototype in ~500 lines
🗺
Map to UE5
Every piece maps to RDG. Read the source with confidence.

The Problem
#

Month 1: 3 passes, everything's fine
Depth prepass → GBuffer → lighting. Two barriers, hand-placed. Two textures, both allocated at init. Code is clean, readable, correct.
At this scale, manual management works. You know every resource by name.
Month 6: 12 passes, cracks appear
Same renderer, now with SSAO, SSR, bloom, TAA, shadow cascades. Three things going wrong simultaneously:
Invisible dependencies: someone adds SSAO but doesn't realize GBuffer needs an updated barrier. Visual artifacts on fresh build.
Wasted memory: SSAO and bloom textures never overlap, but aliasing them means auditing every pass that might touch them. Nobody does it.
Silent reordering: two branches touch the render loop. Git merges cleanly, but the shadow pass ends up after lighting. Subtly wrong output ships unnoticed.
No single change broke it. The accumulation broke it.
Month 18: 25 passes, nobody touches it
The renderer works, but:
900 MB VRAM. Profiling shows 400 MB is aliasable, but the lifetime analysis would take a week and break the next time anyone adds a pass.
47 barrier calls. Three are redundant, two are missing, one is in the wrong queue. Nobody knows which.
2 days to add a new pass. 30 minutes for the shader, the rest to figure out where to slot it and what barriers it needs.
The renderer isn't wrong. It's fragile. Every change is a risk.

The pattern is always the same: manual resource management works at small scale and fails at compound scale. Not because engineers are sloppy. Because no human tracks 25 lifetimes and 47 transitions in their head at once. You need a system that sees the whole frame at once.

The Core Idea
#

A frame graph models an entire frame as a directed acyclic graph (DAG). Each node is a render pass. Each edge carries a resource: a texture, a buffer, or an attachment, from the pass that writes it to every pass that reads it. Here’s what an example deferred-rendering frame looks like:

Z-PrepassShadowsGBufferSSAOLightingPostProcessPresentdepthdepthdepthdepthnormalsshadow mapAOGBuffer MRTsHDRLDR
nodes = render passes  ·  edges = resource dependencies  ·  forks = GPU parallelism

The GPU never sees this graph. It exists only on the CPU, long enough for the system to inspect every pass and every resource before a single GPU command is recorded. That global view is what makes automatic scheduling, memory aliasing, and barrier insertion possible. These are exactly the things that break when done by hand at scale.

The key insight is deferred execution. Instead of recording GPU commands as you encounter each pass, you first build a complete description of the frame: every pass, every resource, every dependency, and only then hand it to a compiler that can see the whole picture at once. It’s the difference between giving a builder one instruction at a time and handing them the full blueprint so they can plan the entire build before picking up a tool. The frame graph’s compile step is that planning phase.

This separation has a second benefit: the graph is a first-class data structure you can inspect, serialize, diff, and visualize. You can dump it to a log and replay it offline. You can compare this frame’s graph to last frame’s to see what changed. None of this is possible when commands are recorded inline, since the information is scattered across dozens of call sites and evaporates the moment the frame ends.

Every frame follows a three-phase lifecycle:


The Declare Step
#

The declare step is pure CPU work: you’re building a description of what this frame needs, not executing it. The key principle: separate what to do from doing it, because the compiler needs to see everything before it can optimize anything.

Registering a pass
#

A pass is a logical unit of GPU work. It might contain a single compute dispatch or hundreds of draw calls. To add one you give the graph two things:

SETUP CALLBACK
Runs now, on the CPU. Declares which resources the pass will touch and how (read, write, render target, UAV, etc.). This is where graph edges come from.
EXECUTE CALLBACK
Stored for later. Records actual GPU commands (draw calls, dispatches, copies) into a command list. Only invoked during the execute phase, potentially on a worker thread.

The setup callback is where everything that matters for the compiler happens: read, write, and create declarations build the edges and resource descriptors that drive sorting, barriers, and aliasing. The execute callback is opaque to the compiler. It just gets invoked at the right time with the right resources bound.

Virtual resources
#

When a pass creates a resource, the graph stores only a descriptor: dimensions, format, usage flags. No GPU memory is allocated. The resource is virtual: an opaque handle the compiler tracks, backed by nothing until the compile step decides where it lives in physical memory.

Handle #3 1920×1080 · RGBA8 · render target
description only (no GPU memory yet)

This is deliberate: the compiler needs to see every resource across the entire frame before it can decide which ones can share physical memory.

Virtual resources fall into two categories:

🔀 Transient
Lifetime: single frame, created and destroyed within the graph
Declared as: descriptor (size, format, usage flags)
GPU memory: allocated at compile, freed at frame end
Aliasable: Yes. Non-overlapping lifetimes share physical memory.
Examples: GBuffer MRTs, SSAO scratch, bloom scratch
📌 Imported (external)
Lifetime: spans multiple frames, owned by an external system
Declared as: existing GPU handle registered into the graph
GPU memory: already allocated. The graph only tracks state.
Aliasable: No. Lifetime extends beyond the frame.
Examples: backbuffer, TAA history, shadow atlas, blue noise LUT

Reads, writes, and edges
#

Each read or write you declare in a setup callback forms a connection in the dependency graph:

ReadConnects this pass to the last writer of a resource, specifying how the resource will be accessed. WriteAdvances the resource to a new version, making future reads depend on this pass instead of earlier writers, and defines the intended access. CreateIntroduces a new virtual resource, tracked by the graph but not yet backed by memory.

The mechanism behind these edges is versioning: every time a pass writes a resource, the version number increments. Readers attach to whatever version existed when they were declared. Multiple passes can read the same version without conflict, but only a write creates a new version and a new dependency. Here’s how that plays out across a real frame:

Resource versioning: HDR target through the frame
v1
WRITE
Lighting: renders lit color into HDR target
v1
read
Bloom: samples bright pixels (still v1)
v1
read
Reflections: samples for SSR (still v1)
v1
read
Fog: reads scene color for aerial blending (still v1)
v2
WRITE
Composite: overwrites with final blended result (bumps to v2)
v2
read
Tonemap: maps HDR → SDR for display (reads v2, not v1)
Reads never bump the version: three passes read v1 without conflict. Only a write creates v2. Tonemap depends on Composite (the v2 writer), with no edge to Lighting or any v1 reader.

These versioned edges are the raw material the compiler works with. Every step that follows (sorting, culling, barrier insertion) operates on this edge set.


The Compile Step
#

Once every pass has declared its resources and dependencies, the compiler takes over. It receives the raw DAG (passes, virtual resources, read/write edges) and transforms it into a concrete execution plan: a sorted pass order, aliased memory layout, and a complete barrier schedule. This entire analysis happens on the CPU, over plain integers and small arrays.

📥 In declared passes + virtual resources + read/write edges
Sort passes into dependency order Cull passes whose outputs are never read Scan lifetimes: record each transient resource's first and last use Alias: assign non-overlapping resources to shared memory slots Compute barriers: insert transitions at every resource state change
📤 Out ordered passes · aliased memory · barrier list · physical bindings

Sorting
#

Before the GPU can execute anything, the compiler needs to turn the DAG into an ordered schedule. The rule is simple: no pass runs before the passes it depends on. This is called a topological sort.

The input is the raw edge set from the declare step: every read/write dependency between passes. The output is a flat list of passes in an order that respects all of them. If pass A writes a resource that pass B reads, A will always appear before B. If two passes share no dependencies, either order is valid, and the compiler is free to pick whichever is cheaper. This isn’t a fixed ordering you design upfront. It’s derived automatically from the declared edges, so adding or removing a pass never requires manual re-sequencing.

The algorithm most compilers use is Kahn’s algorithm. Think of it like a to-do list where you can only start a task once all its prerequisites are done:

  1. Count in-edges. For every pass, count how many predecessors feed into it. Any pass with a count of zero is ready: nothing blocks it.
  2. Pop a ready pass. Pick any zero-count pass and append it to the sorted output.
  3. Decrement successors. Subtract one from every pass that depended on it. New zeros join the ready queue.
  4. Loop until empty. Repeat until the queue drains. Passes left with non-zero counts mean a cycle: the graph is broken.
Queue
Output

Sorting bonus: fewer state switches. Kahn’s algorithm often has several passes ready at the same time, which gives the compiler freedom to choose among them. A sort-time heuristic can use that freedom to group passes that share GPU state, reducing expensive context rolls. A topological sort doesn’t just guarantee correctness. It creates scheduling slack the compiler can exploit for performance.

Culling
#

Once the sort gives us a valid execution order, the compiler can ask a powerful question: does every pass actually contribute to the final image?

In a hand-built renderer, you’d need to manually toggle passes with feature flags or #ifdef blocks. Miss one, and the GPU silently burns cycles on work nobody sees. The frame graph compiler does this automatically. It walks the DAG backward from the final output (usually the swapchain image) and marks every pass that contributes to it, directly or indirectly. Any pass that isn’t on a path to the output gets removed.

This is the same idea as dead-code elimination in a regular compiler: if a function’s return value is never used, the compiler strips it out. Here, if a render pass writes to a texture that no downstream pass ever reads, the entire pass (and its resource allocations) disappear.

Why this matters in practice:

  • Feature toggling is free. Disable bloom by not reading its output, and the bloom pass plus its textures vanish automatically. No if (bloomEnabled) checks scattered through your code.
  • Debug passes cost nothing in release. A visualization pass that only feeds a debug overlay gets culled the moment the overlay is turned off.
  • Artists and designers can experiment with pass configurations without worrying about leftover GPU cost from unused passes.

The algorithm is simple: start from every output the frame needs (typically just the final composite), walk backward along edges, and flag each pass you visit as “alive.” Anything not flagged is dead. Skip it entirely.

Order: Active: 5/5

Allocation and aliasing
#

The sorted order tells the compiler exactly when each resource is first written and last read: its lifetime. Two resources whose lifetimes don’t overlap can share the same physical memory, even if they’re completely different formats or sizes. The GPU allocates one large heap and places multiple resources at different offsets within it.

Without aliasing, every transient texture gets its own allocation for the entire frame, even if it’s only alive for 2–3 passes. With aliasing, a GBuffer that’s done by pass 3 and a bloom buffer that starts at pass 4 can sit in the same memory. Real-world deferred pipelines commonly see 40–50% transient VRAM reduction once aliasing is enabled.

Frostbite / Battlefield 1 — GDC 2017 — reported exactly this savings from graph-driven transient aliasing in production

The allocator works in two passes: first, walk the sorted pass list and record each transient resource’s first write and last read. Then scan resources in order of first-use. For each one, check if an existing heap block is free (its previous occupant has finished). If so, reuse it. If not, allocate a new block.

Correctness constraints. Aliasing introduces four hard requirements. Violating any of them causes GPU corruption or driver-level undefined behaviour:

  1. First access must initialise the resource. The physical block still holds the previous occupant’s texels. For resources with render-target or depth-stencil flags, D3D12 requires one of three operations before any other access: a Clear, a DiscardResource, or a full-subresource Copy. Vulkan equivalently requires load-op CLEAR or DONT_CARE (which acts as a discard). Without this the GPU reads undefined contents from the evicted resource.

D3D12 — CreatePlacedResource docs · Vulkan — memory aliasing spec

  1. Only transient (single-frame) resources qualify. Imported resources that persist across frames (swapchain images, temporal history buffers) must keep their data intact, so they can’t share a heap slot. The allocator enforces this by checking whether a resource is imported and, if so, skipping it during aliasing.

  2. Placed-resource alignment is non-negotiable. D3D12 requires 64 KB (D3D12_DEFAULT_RESOURCE_PLACEMENT_ALIGNMENT) for most textures, or 4 MB for MSAA. Vulkan surfaces the requirement via VkMemoryRequirements::alignment. Any allocator must round up to at least 64 KB to satisfy the common-case constraint.

  3. An aliasing barrier must separate occupants. Before the GPU begins using the new resource, it must make the old occupant’s caches and metadata irrelevant to that shared memory block. D3D12 exposes D3D12_RESOURCE_BARRIER_TYPE_ALIASING. Omitting this synchronization causes timing-dependent corruption that may vary by driver or GPU architecture. The frame graph’s barrier compiler emits these automatically whenever a heap block changes owner between passes.

Deep dive: D3D12 — Memory Aliasing and Data Inheritance, “Aliasing” · D3D12 — D3D12_RESOURCE_ALIASING_BARRIER · Vulkan — VkMemoryBarrier / VkImageMemoryBarrier

Barriers
#

A GPU resource can’t be a render target and a shader input at the same time. The hardware needs to flush caches, change memory layout, and switch access modes between those uses. That transition is a barrier. Miss one and you get rendering corruption or a GPU crash. Add an unnecessary one and the GPU stalls waiting for nothing.

Barriers follow the same rule as everything else in a frame graph: compile analyzes and decides, execute submits and runs. The compile stage is static analysis, the execute stage is command playback.

During compile, the compiler walks every pass in topological order and builds a per-resource usage timeline: which passes touch which resource, and in what state (color attachment, shader read, transfer destination, etc.). For each resource it tracks the current state, starting at Undefined. Whenever a pass needs the resource in a different state, the compiler records a transition (say, ColorAttachment → ShaderRead when GBuffer’s output becomes Lighting’s input) and updates the tracked state. No GPU work happens. This is purely analysis over the declared reads and writes.

A concrete example: suppose GBuffer writes Albedo as a color attachment, then Lighting and PostProcess both read it as a shader resource. The compiler emits one barrier after GBuffer (ColorAttachment → ShaderRead) and nothing between Lighting and PostProcess, since consecutive reads in the same state don’t need a transition. A production compiler goes further: it merges multiple transitions into a single barrier call per pass, removes redundant transitions, and eliminates barriers for resources that are about to be aliased anyway.

The result is a compiled plan where each pass carries a list of pre-barriers alongside its execute callback. At execution time the loop is trivial: for each pass, submit its precomputed barriers (vkCmdPipelineBarrier / ResourceBarrier), begin the render pass, call the execute callback, end the render pass. No graph walking, no state comparison, no decisions. The GPU receives exactly what was precomputed.

ResourceFormatCurrent StateBarrier

The Execute Step
#

The plan is ready. Now the GPU gets involved. Every decision has already been made during compile: pass order, memory layout, barriers, physical resource bindings. Execute just walks the plan.

This is deliberate. The entire point of the declare/compile split is to front-load all the analysis so that execution becomes a trivial loop. No graph traversal, no state comparisons, no dependency checks. The system iterates through the compiled pass list, submits the precomputed barriers for each pass (vkCmdPipelineBarrier in Vulkan, ResourceBarrier in D3D12), begins the render pass, invokes the execute lambda, and ends the render pass. That pattern repeats until every pass has been recorded into the command buffer.

▶ EXECUTE: recording GPU commands
FOR EACH PASS
  • submit precomputed barriers
  • begin render pass
  • call execute() lambda: draw calls, dispatches, copies
  • end render pass
The only phase that touches the GPU API (resources already bound)

Each execute lambda sees a fully resolved environment: barriers already computed and stored in the plan, memory already allocated, resources ready to bind. The lambda just records draw calls, dispatches, and copies. All the intelligence lives in the compile step.

Parallel command recording
#

The compiled plan doesn’t just decide what runs. It reveals what can run at the same time. Because each lambda only touches its own declared resources and all barriers are precomputed, the engine knows exactly which passes are independent and can record them on separate CPU threads simultaneously.

Parallel recording: conceptual flow
The compiled plan identifies groups of passes with no dependencies between them (they appear at the same depth in the sorted order). Each independent pass is dispatched to a worker thread, which records GPU commands into its own secondary command buffer (Vulkan) or command list (D3D12). Once all threads finish, the engine merges the recorded buffers into the primary command buffer in the correct order and submits to the GPU.

The scalability gain comes from the DAG itself: passes at the same depth in the topological order have no edges between them, so recording them in parallel requires no additional synchronization. The more independent passes a frame has, the more CPU cores stay busy, and modern frames have plenty of independence (shadow cascades, GBuffer, SSAO, and bloom often share a depth level).

Cleanup and reset
#

After every pass has been recorded, cleanup of the graph’s own transient state is trivial. The single-frame lifetime rule applies to graph-owned temporaries, so the system just resets the transient memory pool in one shot (every GBuffer, scratch texture, and temporary buffer vanishes together). Imported resources like the swapchain, TAA history, or shadow atlas aren’t reset or destroyed. The graph still tracks their reads, writes, dependencies, and barriers while compiling the frame; they simply remain owned by external systems and persist across frames. The graph object itself clears its pass list and resource table, leaving it empty and ready for the next frame’s declare phase to start fresh. This reset-and-rebuild cycle is what lets engines add or remove passes freely without any teardown logic.

Deep dive: UE5 RDG — “Transient Resources” · UE5 RDG — “External Resources”


Rebuild Strategies
#

How often should the graph recompile? Three approaches, each a valid tradeoff:

🔄 Dynamic
Rebuild every frame.
Cost: microseconds
Flexibility: total. Passes can appear, disappear, or change every frame
Hybrid
Cache compiled result, invalidate on change.
Cost: near-zero on hit
Flexibility: total, but requires dirty-tracking to know when to invalidate the cache
🔒 Static
Compile once at init, replay forever.
Cost: zero
Flexibility: none. The pipeline is locked at startup
Rare in practice

Dynamic is the simplest approach and the most common starting point. The compile cost is low (sorting, culling, aliasing, and barrier computation are all CPU-side integer work over small arrays), but it isn’t zero. It scales with the number of passes and resources, and on CPU-constrained platforms (consoles, mobile) or graphs with hundreds of passes, the per-frame cost can become noticeable.

Hybrid exists precisely because of that cost. When the graph topology is mostly stable frame-to-frame (same passes, same connections), there’s no reason to recompute the same plan 60 times per second. A hybrid approach caches the compiled result and only invalidates when the declared graph actually changes. Some engines detect that automatically (hashing the pass + resource set, dirty bits, topology fingerprints). Others make it an explicit API decision: the caller decides when to recompile because it already knows which events can change the graph. The tradeoff is the same either way: you must guarantee a stale plan is never replayed against a changed graph.

Deep dive: Khronos Vulkan Guide — Common Pitfalls, “Recording Command Buffers” (why fresh per-frame recording is often acceptable in practice)

Static compiles once at init and replays the same plan forever. It’s rarely useful because the whole point of a frame graph is flexibility: feature toggles, dynamic quality scaling, debug overlays. A locked pipeline can’t adapt.


The Payoff
#

❌ Without Graph
✅ With Graph
Memory aliasing
Opt-in, fragile, rarely done
Memory aliasing
Automatic: compiler sees all lifetimes. ~50% transient VRAM savings (Frostbite, BF1 — GDC 2017)
Lifetimes
Manual create/destroy, leaked or over-retained
Lifetimes
Scoped to first..last use. Zero waste.
Barriers
Manual, per-pass
Barriers
Precomputed at compile from declared read/write
Pass reordering
Breaks silently
Pass reordering
Safe: compiler respects dependencies
Pass culling
Manual ifdef / flag checks
Pass culling
Automatic: unused outputs = dead pass
Context rolls
Hard-coded pass order: unnecessary state switches
Context rolls
Sort heuristic groups compatible passes: fewer switches (gain varies by topology)
Advanced: covered in Part III
Async compute
Manual queue sync
Async compute
Compiler schedules independent passes across queues
Split barriers
Manual begin/end placement
Split barriers
Compiler overlaps flushes with unrelated work

What’s next
#

That’s the full theory (sorting, culling, barriers, aliasing), everything a frame graph compiler does. Part II: Build It turns every concept from this article into running C++, three iterations from blank file to a working FrameGraph class with automatic barriers and memory aliasing.


Frame Graph - This article is part of a series.
Part : This Article