Change8

v0.23.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking25 features🐛 30 fixes3 deprecations🔧 23 symbols

Summary

v0.23.0 brings significant hardening and optimization for DeepSeek-V4, expands Model Runner V2 to Llama/Mistral models, and advances the experimental Rust frontend. This release also mandates compatibility with Transformers v5.

⚠️ Breaking Changes

  • Support for Transformers v4 is deprecated; the library now targets Transformers v5. Users must update their dependencies and ensure compatibility with v5 features.
  • The dedicated CUDA graph pool for Eagle has been removed (#44078). If you relied on specific pooling behavior for Eagle, you may need to adjust configurations.

Migration Steps

  1. Update dependencies to target Transformers v5, as v4 support is deprecated.
  2. Review usage of DeepSeek-V4 to ensure compatibility with decoupled sparse MLA metadata.
  3. If using speculative decoding, be aware of changes in lookahead-slot allocation and attention-group splitting.
  4. If using NixlConnector, plan migration away from the `kv_both` role.

✨ New Features

  • Model Runner V2 is now selected by default for Llama and Mistral dense models.
  • Model Runner V2 gained a FlashInfer sampler.
  • Model Runner V2 supports breakable CUDA graphs.
  • Model Runner V2 supports pipeline-parallel bubble elimination.
  • Model Runner V2 supports kernel block-size for hybrid models.
  • Experimental Rust frontend added a streaming `generate` endpoint.
  • Experimental Rust frontend added dynamic LoRA endpoints.
  • Experimental Rust frontend added `/version` and `/server_info` endpoints.
  • Experimental Rust frontend added a server-router extension hook.
  • Experimental Rust frontend added request-ID headers.
  • Experimental Rust frontend added new tool parsers (InternLM2, hy_v3, Phi-4-mini, Gemma4).
  • Added encoder-free Gemma 4 Unified support.
  • Multi-tier KV cache offloading gained an object-store secondary tier.
  • HMA is enabled by default for capable KV cache offloading connectors.
  • KV cache offloading gained tiering support for HMA models.
  • KV cache offloading gained a per-request offloading policy via the `on_new_request` lifecycle hook.
  • Reasoning and tool-call parsing are unified behind a single `Parser.parse()` interface.
  • The Responses parser was migrated to the unified `Parser.parse()` interface.
  • Added support for new models: MiMo-V2.5, Step3.7-Flash, Cosmos3 Reasoner, Gemma 4 Unified encoder-free, JetBrains Mellum v2, Granite Speech Plus, Cohere Mini Code.
  • Added Causal DFlash for speculative decoding.
  • Added support for pluggable `KVCacheSpec`.
  • Added support for sparse NCCL weight transfer for in-place updates.
  • Added async EPLB by default.
  • Added Triton MoE backend on Hopper by default.
  • Added support for FP8 FlashInfer attention for ViT.

🐛 Bug Fixes

  • DeepSeek-V4 sparse MLA metadata is now decoupled from DeepSeek-V3.2.
  • Fixed prefix-cache corruption issue in speculative decoding.
  • Fixed issue where KV-transfer tokens were excluded from `iteration_tokens_total`.
  • Fixed multiple-async-KV-load deadlock.
  • Fixed stale sliding-window block issue in KV cache.
  • Fixed issue where Qwen3-VL/Qwen3-omni-thinker deepstack accuracy failed under `torch.compile`.
  • Fixed EVS for Qwen3-VL.
  • Fixed GLM-5.1 PP loading.
  • Fixed GLM-4.1V processor logits.
  • Fixed GLM-4.6V video loader.
  • Fixed OlmoHybrid init.
  • Fixed Bailing-MoE rotary factor.
  • Fixed Step3 PP residual KeyError.
  • Fixed MiniCPM-V-4.6 video handling.
  • Fixed MiniCPM-O audio unpadding.
  • Fixed MiniCPM-V batched preprocessing.
  • Fixed FunASR-Nano init.
  • Fixed Cohere routing method.
  • Fixed Kimi-K2.5 FlashInfer ViT metadata.
  • Fixed issue where rejection-sampling acceptance-rate was incorrect.
  • Fixed prefix-cache corruption in speculative decoding.
  • Fixed issue where KV blocks were not zeroed for hybrid + FP8 KV cache.
  • Fixed actual batch `max_seq_len` calculation for attention metadata.
  • Fixed config validation rejecting 0/negative knobs in scheduler.
  • Fixed issue where Gemma 4 MTP failed under TP>1.
  • Fixed Gemma 4 block-table mismatch under concurrency.
  • Fixed Gemma 4 transformers-processor startup crash.
  • Fixed Gemma 4 CPU init.
  • Fixed issue where LoRA-adapter-name pooling was incorrect.
  • Fixed issue where skip decode-phase blocks occurred in CPU offload.

Affected Symbols

⚡ Deprecations

  • Support for Transformers v4 is deprecated (#40389).
  • Scheduled functions have deprecation notices (#43358).
  • The NixlConnector `kv_both` role is entering its deprecation cycle (#43874).