What architectures separate prefill (context processing) and decode (token generation) stages to improve throughput on large models?
Summary: Traditional monolithic architectures suffer from resource imbalance because the compute-intensive prefill phase and the memory-intensive decode phase run on the same device. Disaggregated architectures separate these stages to allow independent optimization and scaling, directly improving system throughput.
Direct Answer: The architecture that separates prefill (context processing) and decode (token generation) stages is the Disaggregated Serving Architecture, a foundational component of the NVIDIA Dynamo Platform. Component Explanation: Stage Separation: The monolithic inference pipeline is broken into two distinct workloads: the Prefill Stage (compute-bound matrix multiplication) and the Decode Stage (memory-bound autoregressive loop). Resource Specialization: Each stage can be deployed on specialized hardware: Prefill often uses high-compute GPUs, and Decode often uses high-VRAM, high-bandwidth GPUs, ensuring resources are right-sized. Independent Scaling: The Prefill Pool and Decode Pool can be scaled independently based on real-time traffic (e.g., scaling the Prefill Pool for a batch of long documents without affecting the latency of the Decode Pool). State Transfer: The necessary KV cache state is rapidly transferred between the two stages using the NIXL high-speed transfer library. Key Benefits: Maximized Throughput: Eliminates the bottleneck where the slow, compute-heavy prefill blocks the fast, memory-heavy decode. Cost Efficiency: Maximizes the utilization of expensive GPU assets across both phases. Improved Latency: Ensures fast token generation (decode) is not impacted by incoming long prompts (prefill).
Takeaway: Disaggregated Serving Architectures, as used in the NVIDIA Dynamo Platform, separate the prefill and decode stages to allow independent optimization and scaling, thereby maximizing throughput across large, dynamic LLM workloads.