v0.12.0
Breaking Changes📦 vllmView on GitHub →
⚠ 6 breaking✨ 8 features🐛 5 fixes⚡ 7 deprecations🔧 8 symbols
Summary
vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.
⚠️ Breaking Changes
- PyTorch upgrade to 2.9.0 requires CUDA 12.9 environment.
- Removed 'num_lookahead_slots' parameter.
- Removed 'best_of' parameter.
- Removed LoRA extra vocab support.
- Mistral format auto-detection now applied for model loading.
- Online quantization logic moved to 'model.load_weights'.
Migration Steps
- Update host environment to CUDA 12.9.
- Replace 'best_of' and 'num_lookahead_slots' parameters in API calls as they are removed.
- Update LoRA configurations to remove extra vocab dependencies.
- Transition away from 'xformers' backend to supported alternatives like FlashInfer or Triton.
- Update GGUF loading code to use the new 'repo_id:quant_type' syntax.
✨ New Features
- GPU Model Runner V2 (Experimental): Refactored execution pipeline with persistent block tables and Triton-native sampler.
- EAGLE Speculative Decoding: Multi-step CUDA graph support, DP>1 support, and multimodal support.
- Prefill Context Parallel (PCP): Partitions sequence dimension during prefill for long-sequence inference.
- AMD ROCm Expansion: Support for DeepSeek v3.2, SparseMLA, and FP8 MLA decode.
- New Model Support: PLaMo-3, OpenCUA-7B, HunyuanOCR, Mistral Large 3, and Gemma3 GGUF.
- RLHF Support: Pause and resume generation for asynchronous RL training.
- Audio Support: Audio embeddings in chat completions and Qwen3 Omni audio-in-video.
- Optimization Levels: Added -O0 through -O3 flags to trade startup time for performance.
🐛 Bug Fixes
- Fixed QwenVL cos/sin cache optimization.
- Removed -1 temperature hack in Triton-native sampler.
- Reduced DeepGEMM N dim restriction from 128 to 64 multiplier.
- Improved H200 Fused MoE configuration.
- Reduced Docker image size by ~200MB.
🔧 Affected Symbols
GPUModelRunnerV2ParallelConfigCompilationConfig.use_inductorSamplingParamsmodel.load_weightsAiterFlashAttentionBackendFusedMoEToolServer⚡ Deprecations
- xformers backend is deprecated.
- Setting 'seed=None' is deprecated.
- ParallelConfig's direct child EPLB fields (scheduled for removal).
- guided_* config fields (scheduled for removal).
- override_pooler_config and disable_log_requests (scheduled for removal).
- CompilationConfig.use_inductor (scheduled for removal).
- Old metrics (scheduled for removal).