Part I covered the core (sorting, culling, barriers, aliasing) and Part II built it in C++. The same DAG enables the compiler to go further. It can schedule independent work across GPU queues and split barrier transitions to hide cache-flush latency.
Async Compute#
Barriers optimize work on a single GPU queue. But modern GPUs expose at least two: a graphics queue and a compute queue. If two passes have no dependency path between them in the DAG, the compiler can schedule them on different queues simultaneously.
Finding parallelism#
The compiler needs to answer one question for every pair of passes: can these run at the same time? Two passes can overlap only if neither depends on the other, directly or indirectly. A pass that writes the GBuffer can’t overlap with lighting (which reads it), but it can overlap with SSAO if they share no resources.
The algorithm is called reachability analysis: for each pass, the compiler figures out every other pass it can eventually reach by following edges forward through the DAG. If pass A can reach pass B (or B can reach A), they’re dependent. If neither can reach the other, they’re independent and safe to run on separate queues.
Minimizing fences#
Cross-queue work needs GPU fences: one queue signals, the other waits. Each fence adds dead GPU time: async workloads under ~0.2 ms are unlikely to show any benefit because fence resolution overhead alone eats the gain, and AMD’s RDNA Performance Guide advises minimizing queue synchronization because “each fence has a CPU and GPU cost” (GPUOpen). Offload three passes to async compute and you might need three separate fences, one per synchronization point, and the accumulated stall time from waiting on all of them can negate the overlap benefit entirely. The compiler applies transitive reduction to collapse those down:
What makes overlap good or bad#
Solving fences is the easy part. The compiler handles that. The harder question is whether overlapping two specific passes actually helps:
Should this pass go async?#
Try it yourself: move compute-eligible passes between queues and see how fence count and frame time change:
Split Barriers#
Async compute hides latency by overlapping work across queues. Split barriers achieve the same effect on a single queue, by spreading one resource transition across multiple passes instead of stalling on it.
A regular barrier does a cache flush, state change, and cache invalidate in one blocking command: the GPU finishes the source pass, stalls while the transition completes, then starts the next pass. Every microsecond of that stall is wasted.
A split barrier breaks the transition into two halves and spreads them apart:
The passes between begin and end are the overlap gap, executing while the cache flush happens in the background. The compiler places these automatically: begin immediately after the source pass, end immediately before the destination.
D3D12_RESOURCE_BARRIER_FLAG_BEGIN_ONLY / END_ONLYHow much gap is enough?#
Putting It All Together#
You’ve now seen every piece the compiler works with: topological sorting, pass culling, barrier computation, async compute scheduling, memory aliasing, split barriers. In a simple 5-pass pipeline these feel manageable. In a production renderer? You’re looking at 15–25 passes, 30+ resource edges, and dozens of implicit dependencies, all inferred from read() and write() calls that no human can hold in their head at once.
read() call and the graph silently reorders two passes that shouldn't overlap. Add an assert and you'll catch the symptom, but not the missing edge that caused it.Since the frame graph is a DAG, every dependency is explicitly encoded in the structure. That means you can build tools to visualize the entire pipeline: every pass, every resource edge, every implicit ordering decision, something that’s impossible when barriers and ordering are scattered across hand-written render code.
The explorer below is a production-scale graph. Toggle each compiler feature on and off to see exactly what it contributes. Click any pass to inspect its dependencies: every edge was inferred from read() and write() calls, not hand-written.
What’s next#
Async compute and split barriers are compiler features: they plug into the same DAG we built in Part II. But how do production engines actually ship all of this at scale? Part IV: Production Engines examines UE5’s RDG and Frostbite’s FrameGraph side by side, covering parallel command recording, legacy migration, and the engineering trade-offs that only matter at 700+ passes per frame.
