v0.24.0

Breaking Changes

📅 Jun 29, 2026📦 vllmView on GitHub →

⚠ 2 breaking✨ 36 features🐛 19 fixes⚡ 2 deprecations🔧 41 symbols

Summary

v0.24.0 introduces extensive support and performance optimizations for new models like MiniMax-M3 and DeepSeek-V4, matures the Model Runner V2 with default quantization support, and overhauls device selection by removing internal use of CUDA_VISIBLE_DEVICES.

⚠️ Breaking Changes

vLLM no longer sets the internal environment variable `CUDA_VISIBLE_DEVICES`. Users must now explicitly specify target devices using the new `device_ids` argument when initializing the engine or API server.
On ROCm platforms, the use of `CUDA_VISIBLE_DEVICES` is now deprecated, signaling a future removal. Users should transition to using the `device_ids` argument.

Migration Steps

Replace internal setting of `CUDA_VISIBLE_DEVICES` with the explicit `device_ids` argument when initializing vLLM components (e.g., `LLM(..., device_ids=[0, 1])`).

✨ New Features

Added support for the new MiniMax-M3 model.
Implemented BF16/FP8 indexer via MSA, MXFP4 support, FP8 sparse GQA, and extensive AMD/ROCm tuning for MiniMax-M3.
Model Runner V2 (MRv2) now supports quantized models by default.
Model Runner V2 (MRv2) now enables GraniteMoE by default.
Introduced a new Streaming Parser Engine to unify tool-call/reasoning parsing across models (supporting Qwen3, MiniMax-M2, GLM-4.7/5.1/5.2, Nemotron V3).
Added support for DiffusionGemma, including a CPU path and structured-output guardrails for diffusion decoders.
Integrated DeepEP v2 for expert parallelism.
Rust frontend added API-key authentication, CORS support, `/tokenize` + `/detokenize`, control endpoints (`/pause`, `/resume`, `/is_paused`, `/abort_requests`), and `/get_world_size`.
Added `thinking_token_budget` support to the Rust frontend.
Added a Python bridge for Rust tool parsers.
Introduced the `device_ids` argument for explicit device selection, replacing internal manipulation of `CUDA_VISIBLE_DEVICES`.
DeepSeek-V4 optimizations include FlashInfer sparse index cache, prefill chunk-planning optimization, cluster-cooperative topK kernel, contiguous per-block KV allocations, and native DSA indexer decode on SM100.
Enabled DeepSeek-V4 on SM120 alongside GLM-5.1.
MRv2 gained migration support for Qwen + DeepSeek-V2 MoE models and DFlash speculative decoding.
Added more accurate FP32 Gumbel sampling in MRv2.
KV cache watermark introduced to reduce preemptions.
Two-phase allocation implemented for cross-group prefix-cache hits.
Marconi-style admission policy for hybrid cache.
Prefix-cache retention added for Mamba/linear attention.
Fastsafetensors ParallelLoader implemented for weight loading.
Support for releasing cached device memory under pressure on UMA GPUs.
Structured outputs added for beam search.
Graceful fallback implemented when `numactl --membind` is blocked.
Config-class registration moved before tokenizer initialization.
Async scheduling implemented with prompt embeds for multimodal models.
DeepEP v2 integration includes token-bound and topk-index fixes.
NIXL EP enhancements: DBO, top-k index dtype query, and NVFP4 post-receive quantization skip.
Elastic-EP communicator added.
KV push from prefill to decode via NIXL.
Per-region KV transfer classification for mixed full-attn + MLA groups.
Mooncake pipeline-parallel PD support, async lookup, compact chunk-hash zero-copy lookup, and SWA-block skipping.
Multi-tier async batched lookup for KV offloading.
Packed HMA KV-cache layout implemented (gated).
Parallel-agnostic fs-tier cache for KV offloading.
SM90 CUTLASS FP8 mm odd-M support via swap_ab, resulting in significant kernel speedup.
Tuned `fused_moe` FP8 for Qwen3-Next-80B on H100 (+25%).

🐛 Bug Fixes

Fixed a MiniMax-M2 performance regression.
Fixed FP8 KV-cache issue related to MiniMax-M3.
Fixed OOM issue for DeepSeek-V4.
Fixed MTP projection prefixing for DeepSeek-V4.
Fixed KV-cache dtypes support for DeepSeek-V4.
Fixed Qwen3.5 EP weight-loading issue.
Fixed Llama4 weight loading and streamed loading to prevent host-OOM.
Fixed MiMo v2.x QKV TP sharding + FP4 implementation.
Fixed ColQwen3.5 retrieval correctness.
Fixed MiDashengLM TP>1 audio-encoder crash.
Fixed device-placement and image-size issues for MiniCPM-o/V.
Fixed Cohere2 MoE weight loading + parser issues.
Fixed GLM-5 TRT-LLM ragged MLA prefill dimensions.
Fixed `min_tokens` off-by-one error in the V2 GPU sampler.
Fixed LoRA warmup.
Fixed race condition in async accepted counts for speculative decoding.
Fixed FlashMLA sparse accuracy.
Fixed one-shot fused all-reduce PDL NaN issue in distributed core.
Fixed numerous correctness/race issues in KV offloading.

⚡ Deprecations

The internal setting of `CUDA_VISIBLE_DEVICES` is deprecated; use the `device_ids` argument instead.
On ROCm, the use of `CUDA_VISIBLE_DEVICES` is entering a deprecation window.