v0.11.0
Breaking Changes📦 vllm
⚠ 4 breaking✨ 8 features🐛 6 fixes⚡ 3 deprecations🔧 10 symbols
Summary
This release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.
⚠️ Breaking Changes
- Complete removal of the V0 engine. V1 is now the only engine in the codebase. Code relying on V0-specific components like AsyncLLMEngine or MQLLMEngine will fail.
- CUDA graph mode default changed to FULL_AND_PIECEWISE. While generally better, it may impact models that only support PIECEWISE mode.
- C++17 is now globally enforced for builds.
- Removal of Tokenizer group and various V0-specific model runner and executor classes.
Migration Steps
- Upgrade codebase to use V1 engine interfaces as V0 (AsyncLLMEngine, LLMEngine) has been removed.
- Ensure build environments support C++17.
- Update TPU code to use torch_xla.sync instead of xm.mark_step.
- Review CUDA graph settings if using models incompatible with FULL_AND_PIECEWISE mode.
- Note: Avoid using --async-scheduling in this version if preemption is required, as it may produce gibberish output.
✨ New Features
- Support for new architectures: DeepSeek-V3.2-Exp, Qwen3-VL series, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
- KV cache CPU offloading with LRU management.
- DeepGEMM enabled by default for improved throughput.
- Dual-Batch Overlap (DBO) for overlapping computation and communication.
- Support for FP4 (NVFP4) for dense models and Gemma3.
- Added BERT token classification/NER task support.
- Support for RISC-V 64-bit and ARM non-x86 CPU architectures.
- OpenAI API enhancements: prompt logprobs for all tokens and reasoning streaming events.
🐛 Bug Fixes
- Fixed MRoPE dispatch on CPU.
- Fixed Qwen3-Next Pipeline Parallelism (PP) issues.
- Fixed MoE Data Parallel accuracy on Intel XPU.
- Fixed implementation divergence for BLOOM models when using prompt embeds.
- Fixed BNB name matching logic.
- Resolved misleading quantization warnings.
🔧 Affected Symbols
AsyncLLMEngineLLMEngineMQLLMEnginexm.mark_stepMultiModalPlaceholderMapLLM.apply_modelFlashInferDeepGEMMAsyncOutputProcessorSampler⚡ Deprecations
- Deprecated gpu_ metrics in favor of new KV cache metrics.
- Deprecated xm.mark_step in favor of torch_xla.sync for TPU.
- Removed various V0 components including AsyncLLMEngine, LLMEngine, MQLLMEngine, and legacy attention backends.