Frame Graph: Beyond MVP

Frame Graph - This article is part of a series.

Part : Frame Graph: Production Engines

Part : This Article

Part : Frame Graph: Build It

Part : Frame Graph: Theory

ON THIS PAGE

SERIES

I Theory II Build It III Beyond MVP IV Production

📖 Part III of IV. Theory → Build It → Beyond MVP → Production Engines

Part I covered the core (sorting, culling, barriers, aliasing) and Part II built it in C++. The same DAG enables the compiler to go further. It can schedule independent work across GPU queues and split barrier transitions to hide cache-flush latency.

Async Compute
#

Barriers optimize work on a single GPU queue. But modern GPUs expose at least two: a graphics queue and a compute queue. If two passes have no dependency path between them in the DAG, the compiler can schedule them on different queues simultaneously.

Finding parallelism
#

The compiler needs to answer one question for every pair of passes: can these run at the same time? Two passes can overlap only if neither depends on the other, directly or indirectly. A pass that writes the GBuffer can’t overlap with lighting (which reads it), but it can overlap with SSAO if they share no resources.

The algorithm is called reachability analysis: for each pass, the compiler figures out every other pass it can eventually reach by following edges forward through the DAG. If pass A can reach pass B (or B can reach A), they’re dependent. If neither can reach the other, they’re independent and safe to run on separate queues.

Minimizing fences
#

Cross-queue work needs GPU fences: one queue signals, the other waits. Each fence adds dead GPU time: async workloads under ~0.2 ms are unlikely to show any benefit because fence resolution overhead alone eats the gain, and AMD’s RDNA Performance Guide advises minimizing queue synchronization because “each fence has a CPU and GPU cost” (GPUOpen). Offload three passes to async compute and you might need three separate fences, one per synchronization point, and the accumulated stall time from waiting on all of them can negate the overlap benefit entirely. The compiler applies transitive reduction to collapse those down:

Naive: 4 fences

Graphics: [A] ──fence──→ [C]
             └──fence──→ [D]

Compute:  [B] ──fence──→ [C]
             └──fence──→ [D]

Every cross-queue edge gets its own fence

Reduced: 1 fence

Graphics: [A] ─────────→ [C] → [D]
                             ↑
Compute:  [B] ──fence──↗

B's fence covers both C and D
(D is after C on graphics queue)

Redundant fences removed transitively

What makes overlap good or bad
#

Solving fences is the easy part. The compiler handles that. The harder question is whether overlapping two specific passes actually helps:

✅ Complementary

Graphics is ROP/rasterizer-bound (shadow rasterization, geometry-dense passes) while compute runs ALU-heavy shaders (SSAO, volumetrics). Different hardware units stay busy: real parallelism, measurable frame time reduction.

❌ Competing

Both passes are bandwidth-bound or both ALU-heavy: they thrash each other's L2 cache and fight for CU time. The frame gets slower than running them sequentially. Common trap: overlapping two fullscreen post-effects.

Should this pass go async?
#

Is Compute Shader?

❌ requires raster pipeline

Zero Resource Contention with Graphics?

❌ data hazard with graphics

Has Complementary Resource Usage?

❌ same HW units: no overlap

Has Enough Work Between Fences?

❌ sync cost exceeds gain

ASYNC COMPUTE ✅

Good candidates: SSAO alongside ROP-bound geometry, volumetrics during shadow rasterization, particle sim during UI.

Try it yourself: move compute-eligible passes between queues and see how fence count and frame time change:

Split Barriers
#

Async compute hides latency by overlapping work across queues. Split barriers achieve the same effect on a single queue, by spreading one resource transition across multiple passes instead of stalling on it.

A regular barrier does a cache flush, state change, and cache invalidate in one blocking command: the GPU finishes the source pass, stalls while the transition completes, then starts the next pass. Every microsecond of that stall is wasted.

A split barrier breaks the transition into two halves and spreads them apart:

Source pass

writes texture

BEGIN

flush caches

Pass C

unrelated work

Pass D

unrelated work

END

invalidate

Dest pass

reads texture

↑ cache flush runs in background while these execute ↑

The passes between begin and end are the overlap gap, executing while the cache flush happens in the background. The compiler places these automatically: begin immediately after the source pass, end immediately before the destination.

D3D12 MiniEngine — Microsoft DirectX Samples — demonstrates split barrier patterns with D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY / END_ONLY

How much gap is enough?
#

passes

No gap: degenerates into a regular barrier with extra API cost

pass

Marginal: might not cover the full flush latency

passes

Cache flush fully hidden: measurable frame time reduction

∞

cross-queue

Can't split across queues: use an async fence instead

Putting It All Together
#

You’ve now seen every piece the compiler works with: topological sorting, pass culling, barrier computation, async compute scheduling, memory aliasing, split barriers. In a simple 5-pass pipeline these feel manageable. In a production renderer? You’re looking at 15–25 passes, 30+ resource edges, and dozens of implicit dependencies, all inferred from read() and write() calls that no human can hold in their head at once.

This is the trade-off at the heart of every render graph. Dependencies become implicit: the graph infers ordering from data flow, which means you never declare "pass A must run before pass B." That's powerful: the compiler can reorder, cull, and parallelize freely. But it also means dependencies are hidden. Miss a read() call and the graph silently reorders two passes that shouldn't overlap. Add an assert and you'll catch the symptom, but not the missing edge that caused it.

Since the frame graph is a DAG, every dependency is explicitly encoded in the structure. That means you can build tools to visualize the entire pipeline: every pass, every resource edge, every implicit ordering decision, something that’s impossible when barriers and ordering are scattered across hand-written render code.

The explorer below is a production-scale graph. Toggle each compiler feature on and off to see exactly what it contributes. Click any pass to inspect its dependencies: every edge was inferred from read() and write() calls, not hand-written.

What’s next
#

Async compute and split barriers are compiler features: they plug into the same DAG we built in Part II. But how do production engines actually ship all of this at scale? Part IV: Production Engines examines UE5’s RDG and Frostbite’s FrameGraph side by side, covering parallel command recording, legacy migration, and the engineering trade-offs that only matter at 700+ passes per frame.

← Previous: Part II: Build It Next: Part IV: Production Engines →

Frame Graph - This article is part of a series.

Part : Frame Graph: Production Engines

Part : This Article

Part : Frame Graph: Build It

Part : Frame Graph: Theory

Async Compute#

Finding parallelism#

Minimizing fences#

What makes overlap good or bad#

Should this pass go async?#

Split Barriers#

How much gap is enough?#

Putting It All Together#

What’s next#