Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation
Title: Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/multimodal/multimodal_intro.html
Published Time: Sat, 08 Nov 2025 00:32:08 GMT
Markdown Content: Multimodal Inference in Dynamo:#
You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in multimodal examples.
EPD vs. PD Disaggregation#
Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.
1. EPD (Encode-Prefill-Decode) Disaggregation#
The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.
-
Media Input: Image, video, audio, or an embedding URL is provided.
-
Process Flow:
-
A dedicated Encode Worker is launched separately to handle the embedding extraction from the media input.
-
The extracted embeddings are transferred to the main engine via the NVIDIA Inference Xfer Library (NIXL).
-
The main Engine performs the remaining Prefill Decode Disaggregation steps to generate the output.
-
-
Benefit: This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.
2. PD (Prefill-Decode) Disaggregation#
The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.
-
Media Input: Image, video, or audio is loaded.
-
Process Flow:
-
The main Engine receives the media input.
-
The Engine executes the full sequence: Encode + Prefill + Decode.
-
-
Note: In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.
Inference Framework Support Matrix#
Dynamo supports multimodal capabilities across leading LLM inference backends, including vLLM, TensorRT-LLM (TRT-LLM), and SGLang. The table below details the current support level for EPD/PD and various media types for each stack.
| Stack | EPD Support | PD Support | Image | Video | Audio |
|---|---|---|---|---|---|
| vLLM | ✅ | ✅ | ✅ | ✅ | 🚧 |
| TRT-LLM | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| SGLang | ✅ | ❌ | ✅ | ❌ | ❌ |
Links/Buttons: