Change8

v0.11.0

Breaking Changes
📦 vllm
4 breaking8 features🐛 6 fixes3 deprecations🔧 10 symbols

Summary

This release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.

⚠️ Breaking Changes

  • Complete removal of the V0 engine. V1 is now the only engine in the codebase. Code relying on V0-specific components like AsyncLLMEngine or MQLLMEngine will fail.
  • CUDA graph mode default changed to FULL_AND_PIECEWISE. While generally better, it may impact models that only support PIECEWISE mode.
  • C++17 is now globally enforced for builds.
  • Removal of Tokenizer group and various V0-specific model runner and executor classes.

Migration Steps

  1. Upgrade codebase to use V1 engine interfaces as V0 (AsyncLLMEngine, LLMEngine) has been removed.
  2. Ensure build environments support C++17.
  3. Update TPU code to use torch_xla.sync instead of xm.mark_step.
  4. Review CUDA graph settings if using models incompatible with FULL_AND_PIECEWISE mode.
  5. Note: Avoid using --async-scheduling in this version if preemption is required, as it may produce gibberish output.

✨ New Features

  • Support for new architectures: DeepSeek-V3.2-Exp, Qwen3-VL series, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
  • KV cache CPU offloading with LRU management.
  • DeepGEMM enabled by default for improved throughput.
  • Dual-Batch Overlap (DBO) for overlapping computation and communication.
  • Support for FP4 (NVFP4) for dense models and Gemma3.
  • Added BERT token classification/NER task support.
  • Support for RISC-V 64-bit and ARM non-x86 CPU architectures.
  • OpenAI API enhancements: prompt logprobs for all tokens and reasoning streaming events.

🐛 Bug Fixes

  • Fixed MRoPE dispatch on CPU.
  • Fixed Qwen3-Next Pipeline Parallelism (PP) issues.
  • Fixed MoE Data Parallel accuracy on Intel XPU.
  • Fixed implementation divergence for BLOOM models when using prompt embeds.
  • Fixed BNB name matching logic.
  • Resolved misleading quantization warnings.

🔧 Affected Symbols

AsyncLLMEngineLLMEngineMQLLMEnginexm.mark_stepMultiModalPlaceholderMapLLM.apply_modelFlashInferDeepGEMMAsyncOutputProcessorSampler

⚡ Deprecations

  • Deprecated gpu_ metrics in favor of new KV cache metrics.
  • Deprecated xm.mark_step in favor of torch_xla.sync for TPU.
  • Removed various V0 components including AsyncLLMEngine, LLMEngine, MQLLMEngine, and legacy attention backends.