Change8

v0.22.0

📦 vllmView on GitHub →
95 features🐛 23 fixes🔧 46 symbols

Summary

This release focuses heavily on DeepSeek V4 maturity with new kernel support and packaging, significant advancements in Model Runner V2, and the introduction of an experimental Rust frontend. Performance saw notable gains from batch-invariant inference with Cutlass FP8 and the rollout of multi-tier KV cache offloading.

Migration Steps

  1. If using DeepSeek V4, note that the model structure has been reorganized into the vllm/models/deepseek_v4/ package.
  2. If using Model Runner V2 with KV connectors, be aware that it now automatically falls back to MRv1.
  3. If using Qwen3 dense models, Model Runner V2 may be selected by default.
  4. If using KV offloading, be aware of the new multi-tier framework and potential changes in layout preference (prefer HND layout).
  5. If using MoE, note that experts have been moved to the experts/ directory.
  6. If using CPU/RISC-V, consider using the --cpu-distributed-timeout-seconds flag for distributed setups.
  7. If using EPLB, note the change in default EPLB communicator.
  8. If using LoRA with MoE, be aware of updates to 2D-weight memory usage under EP.

✨ New Features

  • DeepSeek V4 model reorganized into a dedicated vllm/models/deepseek_v4/ package.
  • DeepSeek V4 gained NVFP4 fused MoE support.
  • DeepSeek V4 gained full + piecewise CUDA graph support.
  • DeepSeek V4 gained MTP speculative decoding support.
  • Model Runner V2 added an oracle that selects MRv2 for Qwen3 dense models by default.
  • Model Runner V2 added sleep-mode weight reload capability.
  • Model Runner V2 added support for update_config.
  • Model Runner V2 added shared KV-cache layers.
  • Experimental Rust front-end integration landed.
  • A DP Supervisor for data-parallel serving was implemented.
  • Batch-invariant inference gained Cutlass FP8 support, resulting in a 28.9% end-to-end latency improvement.
  • Batch-invariant inference gained compile-mode support on SM80.
  • Batch-invariant inference gained an NVFP4 Cutlass linear path.
  • A new multi-tier KV cache offloading framework was introduced.
  • Multi-tier KV cache offloading now supports a Python filesystem secondary tier.
  • Multi-tier KV cache offloading added support for DeepSeek V4.
  • Multi-tier KV cache offloading added Mooncake disk offloading.
  • New architecture support: MiniCPM-V 4.6.
  • New architecture support: InternS2 Preview.
  • New architecture support: OpenVLA.
  • Speculative decoding gained custom callable proposer backend.
  • Speculative decoding added post-norm EAGLE-3 speculators.
  • Speculative decoding added peagle speculators.
  • Speculative decoding added support for hybrid-attention models in extract_hidden_states.
  • Speculative decoding added non-MTP speculation for NemotronH.
  • Speculative decoding added shared MTP weights in MRv2.
  • Qwen3.5/3.6 gained GDN output-projection flatten.
  • Qwen3.5/3.6 gained GatedDeltaNet Marlin TP>=2 fix.
  • Qwen3.5/3.6 gained ViT full CUDA graph support.
  • Gemma3/Gemma4 gained mixed-resolution image co-batching support.
  • Gemma3/Gemma4 gained MoE routing closure fix.
  • Gemma3/Gemma4 gained batched vision encoder for image/video.
  • Kimi-K2.5 gained ability to skip vision-tower dtype conversion under quantization.
  • Cohere MoE is now enabled.
  • Cohere gained pipeline parallelism for vision.
  • Tool calling gained Apertus tool parser.
  • Tool calling gained shared coerce_to_schema_type across MiniMax-M2 / DeepSeek-V3.2 / Seed-OSS parsers.
  • ViT CUDA graph support extended to Qwen2-VL, Step3-VL encoder, and Qwen3.5.
  • Model Runner V2 added FP32 gumbel sampling.
  • Model Runner V2 added correctness fixes for logprob_token_ids.
  • MoE refactor introduced ExpertMapManager.
  • MoE refactor moved experts to experts/ directory.
  • MoE refactor introduced RoutedExperts alias for FusedMoE.
  • Mamba attention module refactored.
  • Mamba gained Mamba2 SSD kernel warmup.
  • Mamba gained bf16 SSM cache support.
  • Mamba gained GPU-side state postprocessing fused kernel.
  • Mamba now runs single-token extends as decodes.
  • KV events now emit KV cache metadata.
  • Allocator introduced manual cumem allocator enable.
  • Allocator introduced stream-aware free callback.
  • EPLB now stages/commits MoE quant method on reconfigure.
  • FlashInfer gained b12x MoE + FP4 GEMM for SM120/121.
  • FlashInfer gained per-tensor FP8 CUTLASS on SM12.1.
  • FlashInfer gained head_dim=512 support for TRTLLM attention.
  • FlashInfer gained GDN prefill kernel for SM100.
  • Performance improvements via CutlassFP8 padding pre-processing (+13.5% TTFT).
  • Performance improvements via padded NVFP4 quant kernel (+2.4–5.7% E2E).
  • Performance improvements via GPU<->CPU sync elimination (1/n and 4/n).
  • Performance improvements via fused RoPE+KVCache+q_concat for MLA.
  • Performance improvements via MLA compute_prefill_context / _v_up_proj optimizations.
  • Performance improvements via penalties Triton kernel.
  • Performance improvements via FULL CUDA graph capture for TRITON_MLA decode.
  • ROCm gained flash sparse MLA Triton kernels.
  • ROCm gained gluon paged MQA logits on gfx950/MI355X.
  • ROCm gained RMSNorm+Quant fusion for gfx950.
  • ROCm gained XGMI backend for MoRI connector.
  • ROCm gained QuickReduce min-size override.
  • CPU/RISC-V gained RVV-optimized attention kernels for RISC-V Vector Extension.
  • CPU/RISC-V gained fused GDN for AMX CPU.
  • CPU/RISC-V gained MXFP4 W4A16 MoE support.
  • CPU/RISC-V gained experimental Triton + MRv2 on CPU.
  • CPU/RISC-V improved CPU thread utilization.
  • Intel XPU gained GPTQ int4 support.
  • Intel XPU gained mxfp8 MoE support.
  • Intel XPU gained FP8 block-scaled quantization.
  • Intel XPU gained multiple sparse-attention kernels.
  • Intel XPU gained MoE topk routing + MXFP4 fallback.
  • Intel XPU gained CT W4A4 MXFP4 path.
  • Disaggregated serving (NIXL) gained lease-renewal TTL for KV blocks on P.
  • Disaggregated serving (NIXL) gained GDN support for PD with NIXL.
  • Mooncake gained disk offloading in MooncakeStoreConnector.
  • Mooncake gained HMA support for DSV4.
  • Mooncake gained operation metrics.
  • Data parallel gained DP Supervisor.
  • Data parallel gained ability to publish request counts at engine-step start.
  • LoRA gained one-shot Triton kernel for MoE LoRA.
  • LoRA gained simultaneous 2D & 3D MoE LoRA adapters.
  • MXFP4 gained linear layers + compressed-tensors integration.
  • NVFP4 gained DeepSeek V4 fused MoE support.
  • NVFP4 gained ModelOpt W4A16 NVFP4 fused MoE + mixed-precision dispatch.
  • NVFP4 gained batch-invariant NVFP4 Cutlass linear path.
  • Quark gained ability to load Quark NVFP4 checkpoints.
  • Quark gained W8A8 INT8 garbage-output fix on Step-3.5-Flash.
  • AutoRound gained W4A16 support.

🐛 Bug Fixes

  • DeepSeek V4 accuracy fixes landed.
  • DeepSeek V4 ROCm parity fixes landed.
  • Model Runner V2 had many correctness fixes.
  • Model Runner V2 automatically falls back to MRv1 when a KV connector is present.
  • Fix for Qwen3.5/3.6 runai-streamer weight loading for Qwen3.5/MTP/Qwen3-VL.
  • Fix for Qwen3.5/3.6 KDA chunk-prefill exp2 semantics.
  • Fix for Gemma3/Gemma4 mixed-resolution image co-batching crash.
  • Fix for Gemma3/Gemma4 MoE routing closure.
  • Fix for Gemma3/Gemma4 tool-parser float-corruption.
  • Fix for Gemma3/Gemma4 multi-GPU issues.
  • Fix for Kimi-K2.5 mm_projector dtype.
  • Fix for Qwen3Coder anyOf/oneOf/$ref resolution re-land.
  • Fix for Model Runner V2 prompt-logprobs size.
  • Fix for KV offloading store-deferral.
  • Fix for VLM-wrapper init in EPLB.
  • Fix for MoE LoRA align-kernel grid.
  • Fix for FlashInfer TRTLLM NvFP4 monolithic MoE routing.
  • Fix for TRTLLM NVFP4 MoE chunking.
  • Fix for Quark W8A8 INT8 garbage-output on Step-3.5-Flash.
  • Fix for multi-node TP>8 in NIXL.
  • Fix for NIXL side-channel host-selection.
  • Fix for Mooncake load-failure propagation.
  • Fix for Mooncake finish-after-preemption handling.

Affected Symbols