v0.24.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 36 features🐛 19 fixes⚡ 2 deprecations🔧 41 symbols
Summary
v0.24.0 introduces extensive support and performance optimizations for new models like MiniMax-M3 and DeepSeek-V4, matures the Model Runner V2 with default quantization support, and overhauls device selection by removing internal use of CUDA_VISIBLE_DEVICES.
⚠️ Breaking Changes
- vLLM no longer sets the internal environment variable `CUDA_VISIBLE_DEVICES`. Users must now explicitly specify target devices using the new `device_ids` argument when initializing the engine or API server.
- On ROCm platforms, the use of `CUDA_VISIBLE_DEVICES` is now deprecated, signaling a future removal. Users should transition to using the `device_ids` argument.
Migration Steps
- Replace internal setting of `CUDA_VISIBLE_DEVICES` with the explicit `device_ids` argument when initializing vLLM components (e.g., `LLM(..., device_ids=[0, 1])`).
✨ New Features
- Added support for the new MiniMax-M3 model.
- Implemented BF16/FP8 indexer via MSA, MXFP4 support, FP8 sparse GQA, and extensive AMD/ROCm tuning for MiniMax-M3.
- Model Runner V2 (MRv2) now supports quantized models by default.
- Model Runner V2 (MRv2) now enables GraniteMoE by default.
- Introduced a new Streaming Parser Engine to unify tool-call/reasoning parsing across models (supporting Qwen3, MiniMax-M2, GLM-4.7/5.1/5.2, Nemotron V3).
- Added support for DiffusionGemma, including a CPU path and structured-output guardrails for diffusion decoders.
- Integrated DeepEP v2 for expert parallelism.
- Rust frontend added API-key authentication, CORS support, `/tokenize` + `/detokenize`, control endpoints (`/pause`, `/resume`, `/is_paused`, `/abort_requests`), and `/get_world_size`.
- Added `thinking_token_budget` support to the Rust frontend.
- Added a Python bridge for Rust tool parsers.
- Introduced the `device_ids` argument for explicit device selection, replacing internal manipulation of `CUDA_VISIBLE_DEVICES`.
- DeepSeek-V4 optimizations include FlashInfer sparse index cache, prefill chunk-planning optimization, cluster-cooperative topK kernel, contiguous per-block KV allocations, and native DSA indexer decode on SM100.
- Enabled DeepSeek-V4 on SM120 alongside GLM-5.1.
- MRv2 gained migration support for Qwen + DeepSeek-V2 MoE models and DFlash speculative decoding.
- Added more accurate FP32 Gumbel sampling in MRv2.
- KV cache watermark introduced to reduce preemptions.
- Two-phase allocation implemented for cross-group prefix-cache hits.
- Marconi-style admission policy for hybrid cache.
- Prefix-cache retention added for Mamba/linear attention.
- Fastsafetensors ParallelLoader implemented for weight loading.
- Support for releasing cached device memory under pressure on UMA GPUs.
- Structured outputs added for beam search.
- Graceful fallback implemented when `numactl --membind` is blocked.
- Config-class registration moved before tokenizer initialization.
- Async scheduling implemented with prompt embeds for multimodal models.
- DeepEP v2 integration includes token-bound and topk-index fixes.
- NIXL EP enhancements: DBO, top-k index dtype query, and NVFP4 post-receive quantization skip.
- Elastic-EP communicator added.
- KV push from prefill to decode via NIXL.
- Per-region KV transfer classification for mixed full-attn + MLA groups.
- Mooncake pipeline-parallel PD support, async lookup, compact chunk-hash zero-copy lookup, and SWA-block skipping.
- Multi-tier async batched lookup for KV offloading.
- Packed HMA KV-cache layout implemented (gated).
- Parallel-agnostic fs-tier cache for KV offloading.
- SM90 CUTLASS FP8 mm odd-M support via swap_ab, resulting in significant kernel speedup.
- Tuned `fused_moe` FP8 for Qwen3-Next-80B on H100 (+25%).
🐛 Bug Fixes
- Fixed a MiniMax-M2 performance regression.
- Fixed FP8 KV-cache issue related to MiniMax-M3.
- Fixed OOM issue for DeepSeek-V4.
- Fixed MTP projection prefixing for DeepSeek-V4.
- Fixed KV-cache dtypes support for DeepSeek-V4.
- Fixed Qwen3.5 EP weight-loading issue.
- Fixed Llama4 weight loading and streamed loading to prevent host-OOM.
- Fixed MiMo v2.x QKV TP sharding + FP4 implementation.
- Fixed ColQwen3.5 retrieval correctness.
- Fixed MiDashengLM TP>1 audio-encoder crash.
- Fixed device-placement and image-size issues for MiniCPM-o/V.
- Fixed Cohere2 MoE weight loading + parser issues.
- Fixed GLM-5 TRT-LLM ragged MLA prefill dimensions.
- Fixed `min_tokens` off-by-one error in the V2 GPU sampler.
- Fixed LoRA warmup.
- Fixed race condition in async accepted counts for speculative decoding.
- Fixed FlashMLA sparse accuracy.
- Fixed one-shot fused all-reduce PDL NaN issue in distributed core.
- Fixed numerous correctness/race issues in KV offloading.
Affected Symbols
MiniMax-M3DeepSeek-V4Model Runner V2 (MRv2)GraniteMoEQwenDeepSeek-V2 MoEStreaming Parser EngineQwen3MiniMax-M2GLM-4.7GLM-5.1GLM-5.2Nemotron V3DiffusionGemmaDeepEP v2CUDA_VISIBLE_DEVICESdevice_idsGemma 4FlashAttention (FA4)Qwen3-VLQwen2-VLQwen2.5-VLQwen3.5GLM-4.1VDeepSeek-OCRKimi-VLmllama4Lfm2VLLlama4MiMo v2.xColQwen3.5EXAONE-4.5MiDashengLMMiniCPM-o/VCohere2 MoEColBERT AutoWeightsLoaderGLM-5NIXL EPP2pNcclConnectorMambaDeepGEMM
⚡ Deprecations
- The internal setting of `CUDA_VISIBLE_DEVICES` is deprecated; use the `device_ids` argument instead.
- On ROCm, the use of `CUDA_VISIBLE_DEVICES` is entering a deprecation window.