Change8

v0.24.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking36 features🐛 19 fixes2 deprecations🔧 41 symbols

Summary

v0.24.0 introduces extensive support and performance optimizations for new models like MiniMax-M3 and DeepSeek-V4, matures the Model Runner V2 with default quantization support, and overhauls device selection by removing internal use of CUDA_VISIBLE_DEVICES.

⚠️ Breaking Changes

  • vLLM no longer sets the internal environment variable `CUDA_VISIBLE_DEVICES`. Users must now explicitly specify target devices using the new `device_ids` argument when initializing the engine or API server.
  • On ROCm platforms, the use of `CUDA_VISIBLE_DEVICES` is now deprecated, signaling a future removal. Users should transition to using the `device_ids` argument.

Migration Steps

  1. Replace internal setting of `CUDA_VISIBLE_DEVICES` with the explicit `device_ids` argument when initializing vLLM components (e.g., `LLM(..., device_ids=[0, 1])`).

✨ New Features

  • Added support for the new MiniMax-M3 model.
  • Implemented BF16/FP8 indexer via MSA, MXFP4 support, FP8 sparse GQA, and extensive AMD/ROCm tuning for MiniMax-M3.
  • Model Runner V2 (MRv2) now supports quantized models by default.
  • Model Runner V2 (MRv2) now enables GraniteMoE by default.
  • Introduced a new Streaming Parser Engine to unify tool-call/reasoning parsing across models (supporting Qwen3, MiniMax-M2, GLM-4.7/5.1/5.2, Nemotron V3).
  • Added support for DiffusionGemma, including a CPU path and structured-output guardrails for diffusion decoders.
  • Integrated DeepEP v2 for expert parallelism.
  • Rust frontend added API-key authentication, CORS support, `/tokenize` + `/detokenize`, control endpoints (`/pause`, `/resume`, `/is_paused`, `/abort_requests`), and `/get_world_size`.
  • Added `thinking_token_budget` support to the Rust frontend.
  • Added a Python bridge for Rust tool parsers.
  • Introduced the `device_ids` argument for explicit device selection, replacing internal manipulation of `CUDA_VISIBLE_DEVICES`.
  • DeepSeek-V4 optimizations include FlashInfer sparse index cache, prefill chunk-planning optimization, cluster-cooperative topK kernel, contiguous per-block KV allocations, and native DSA indexer decode on SM100.
  • Enabled DeepSeek-V4 on SM120 alongside GLM-5.1.
  • MRv2 gained migration support for Qwen + DeepSeek-V2 MoE models and DFlash speculative decoding.
  • Added more accurate FP32 Gumbel sampling in MRv2.
  • KV cache watermark introduced to reduce preemptions.
  • Two-phase allocation implemented for cross-group prefix-cache hits.
  • Marconi-style admission policy for hybrid cache.
  • Prefix-cache retention added for Mamba/linear attention.
  • Fastsafetensors ParallelLoader implemented for weight loading.
  • Support for releasing cached device memory under pressure on UMA GPUs.
  • Structured outputs added for beam search.
  • Graceful fallback implemented when `numactl --membind` is blocked.
  • Config-class registration moved before tokenizer initialization.
  • Async scheduling implemented with prompt embeds for multimodal models.
  • DeepEP v2 integration includes token-bound and topk-index fixes.
  • NIXL EP enhancements: DBO, top-k index dtype query, and NVFP4 post-receive quantization skip.
  • Elastic-EP communicator added.
  • KV push from prefill to decode via NIXL.
  • Per-region KV transfer classification for mixed full-attn + MLA groups.
  • Mooncake pipeline-parallel PD support, async lookup, compact chunk-hash zero-copy lookup, and SWA-block skipping.
  • Multi-tier async batched lookup for KV offloading.
  • Packed HMA KV-cache layout implemented (gated).
  • Parallel-agnostic fs-tier cache for KV offloading.
  • SM90 CUTLASS FP8 mm odd-M support via swap_ab, resulting in significant kernel speedup.
  • Tuned `fused_moe` FP8 for Qwen3-Next-80B on H100 (+25%).

🐛 Bug Fixes

  • Fixed a MiniMax-M2 performance regression.
  • Fixed FP8 KV-cache issue related to MiniMax-M3.
  • Fixed OOM issue for DeepSeek-V4.
  • Fixed MTP projection prefixing for DeepSeek-V4.
  • Fixed KV-cache dtypes support for DeepSeek-V4.
  • Fixed Qwen3.5 EP weight-loading issue.
  • Fixed Llama4 weight loading and streamed loading to prevent host-OOM.
  • Fixed MiMo v2.x QKV TP sharding + FP4 implementation.
  • Fixed ColQwen3.5 retrieval correctness.
  • Fixed MiDashengLM TP>1 audio-encoder crash.
  • Fixed device-placement and image-size issues for MiniCPM-o/V.
  • Fixed Cohere2 MoE weight loading + parser issues.
  • Fixed GLM-5 TRT-LLM ragged MLA prefill dimensions.
  • Fixed `min_tokens` off-by-one error in the V2 GPU sampler.
  • Fixed LoRA warmup.
  • Fixed race condition in async accepted counts for speculative decoding.
  • Fixed FlashMLA sparse accuracy.
  • Fixed one-shot fused all-reduce PDL NaN issue in distributed core.
  • Fixed numerous correctness/race issues in KV offloading.

Affected Symbols

⚡ Deprecations

  • The internal setting of `CUDA_VISIBLE_DEVICES` is deprecated; use the `device_ids` argument instead.
  • On ROCm, the use of `CUDA_VISIBLE_DEVICES` is entering a deprecation window.