v0.11.0

Breaking Changes

📅 Oct 2, 2025📦 vllm

⚠ 4 breaking✨ 8 features🐛 6 fixes⚡ 3 deprecations🔧 10 symbols

Summary

This release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.

⚠️ Breaking Changes

Complete removal of the V0 engine. V1 is now the only engine in the codebase. Code relying on V0-specific components like AsyncLLMEngine or MQLLMEngine will fail.
CUDA graph mode default changed to FULL_AND_PIECEWISE. While generally better, it may impact models that only support PIECEWISE mode.
C++17 is now globally enforced for builds.
Removal of Tokenizer group and various V0-specific model runner and executor classes.

Migration Steps

Upgrade codebase to use V1 engine interfaces as V0 (AsyncLLMEngine, LLMEngine) has been removed.
Ensure build environments support C++17.
Update TPU code to use torch_xla.sync instead of xm.mark_step.
Review CUDA graph settings if using models incompatible with FULL_AND_PIECEWISE mode.
Note: Avoid using --async-scheduling in this version if preemption is required, as it may produce gibberish output.

✨ New Features

Support for new architectures: DeepSeek-V3.2-Exp, Qwen3-VL series, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
KV cache CPU offloading with LRU management.
DeepGEMM enabled by default for improved throughput.
Dual-Batch Overlap (DBO) for overlapping computation and communication.
Support for FP4 (NVFP4) for dense models and Gemma3.
Added BERT token classification/NER task support.
Support for RISC-V 64-bit and ARM non-x86 CPU architectures.
OpenAI API enhancements: prompt logprobs for all tokens and reasoning streaming events.

🐛 Bug Fixes

Fixed MRoPE dispatch on CPU.
Fixed Qwen3-Next Pipeline Parallelism (PP) issues.
Fixed MoE Data Parallel accuracy on Intel XPU.
Fixed implementation divergence for BLOOM models when using prompt embeds.
Fixed BNB name matching logic.
Resolved misleading quantization warnings.

🔧 Affected Symbols

AsyncLLMEngineLLMEngineMQLLMEnginexm.mark_stepMultiModalPlaceholderMapLLM.apply_modelFlashInferDeepGEMMAsyncOutputProcessorSampler

⚡ Deprecations

Deprecated gpu_ metrics in favor of new KV cache metrics.
Deprecated xm.mark_step in favor of torch_xla.sync for TPU.
Removed various V0 components including AsyncLLMEngine, LLMEngine, MQLLMEngine, and legacy attention backends.