v0.23.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 25 features🐛 30 fixes⚡ 3 deprecations🔧 23 symbols
Summary
v0.23.0 brings significant hardening and optimization for DeepSeek-V4, expands Model Runner V2 to Llama/Mistral models, and advances the experimental Rust frontend. This release also mandates compatibility with Transformers v5.
⚠️ Breaking Changes
- Support for Transformers v4 is deprecated; the library now targets Transformers v5. Users must update their dependencies and ensure compatibility with v5 features.
- The dedicated CUDA graph pool for Eagle has been removed (#44078). If you relied on specific pooling behavior for Eagle, you may need to adjust configurations.
Migration Steps
- Update dependencies to target Transformers v5, as v4 support is deprecated.
- Review usage of DeepSeek-V4 to ensure compatibility with decoupled sparse MLA metadata.
- If using speculative decoding, be aware of changes in lookahead-slot allocation and attention-group splitting.
- If using NixlConnector, plan migration away from the `kv_both` role.
✨ New Features
- Model Runner V2 is now selected by default for Llama and Mistral dense models.
- Model Runner V2 gained a FlashInfer sampler.
- Model Runner V2 supports breakable CUDA graphs.
- Model Runner V2 supports pipeline-parallel bubble elimination.
- Model Runner V2 supports kernel block-size for hybrid models.
- Experimental Rust frontend added a streaming `generate` endpoint.
- Experimental Rust frontend added dynamic LoRA endpoints.
- Experimental Rust frontend added `/version` and `/server_info` endpoints.
- Experimental Rust frontend added a server-router extension hook.
- Experimental Rust frontend added request-ID headers.
- Experimental Rust frontend added new tool parsers (InternLM2, hy_v3, Phi-4-mini, Gemma4).
- Added encoder-free Gemma 4 Unified support.
- Multi-tier KV cache offloading gained an object-store secondary tier.
- HMA is enabled by default for capable KV cache offloading connectors.
- KV cache offloading gained tiering support for HMA models.
- KV cache offloading gained a per-request offloading policy via the `on_new_request` lifecycle hook.
- Reasoning and tool-call parsing are unified behind a single `Parser.parse()` interface.
- The Responses parser was migrated to the unified `Parser.parse()` interface.
- Added support for new models: MiMo-V2.5, Step3.7-Flash, Cosmos3 Reasoner, Gemma 4 Unified encoder-free, JetBrains Mellum v2, Granite Speech Plus, Cohere Mini Code.
- Added Causal DFlash for speculative decoding.
- Added support for pluggable `KVCacheSpec`.
- Added support for sparse NCCL weight transfer for in-place updates.
- Added async EPLB by default.
- Added Triton MoE backend on Hopper by default.
- Added support for FP8 FlashInfer attention for ViT.
🐛 Bug Fixes
- DeepSeek-V4 sparse MLA metadata is now decoupled from DeepSeek-V3.2.
- Fixed prefix-cache corruption issue in speculative decoding.
- Fixed issue where KV-transfer tokens were excluded from `iteration_tokens_total`.
- Fixed multiple-async-KV-load deadlock.
- Fixed stale sliding-window block issue in KV cache.
- Fixed issue where Qwen3-VL/Qwen3-omni-thinker deepstack accuracy failed under `torch.compile`.
- Fixed EVS for Qwen3-VL.
- Fixed GLM-5.1 PP loading.
- Fixed GLM-4.1V processor logits.
- Fixed GLM-4.6V video loader.
- Fixed OlmoHybrid init.
- Fixed Bailing-MoE rotary factor.
- Fixed Step3 PP residual KeyError.
- Fixed MiniCPM-V-4.6 video handling.
- Fixed MiniCPM-O audio unpadding.
- Fixed MiniCPM-V batched preprocessing.
- Fixed FunASR-Nano init.
- Fixed Cohere routing method.
- Fixed Kimi-K2.5 FlashInfer ViT metadata.
- Fixed issue where rejection-sampling acceptance-rate was incorrect.
- Fixed prefix-cache corruption in speculative decoding.
- Fixed issue where KV blocks were not zeroed for hybrid + FP8 KV cache.
- Fixed actual batch `max_seq_len` calculation for attention metadata.
- Fixed config validation rejecting 0/negative knobs in scheduler.
- Fixed issue where Gemma 4 MTP failed under TP>1.
- Fixed Gemma 4 block-table mismatch under concurrency.
- Fixed Gemma 4 transformers-processor startup crash.
- Fixed Gemma 4 CPU init.
- Fixed issue where LoRA-adapter-name pooling was incorrect.
- Fixed issue where skip decode-phase blocks occurred in CPU offload.
Affected Symbols
⚡ Deprecations
- Support for Transformers v4 is deprecated (#40389).
- Scheduled functions have deprecation notices (#43358).
- The NixlConnector `kv_both` role is entering its deprecation cycle (#43874).