v0.9.0

Breaking Changes

📅 May 15, 2025📦 vllmView on GitHub →

⚠ 4 breaking✨ 8 features🐛 5 fixes⚡ 3 deprecations🔧 9 symbols

Summary

vLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.

⚠️ Breaking Changes

Upgraded to PyTorch 2.7, which changes environment dependencies.
Removal of CUDA 12.4 support; the default wheel is now CUDA 12.8.
The seed is now set to 0 by default for V1 Engine, ensuring deterministic outputs even if temperature > 0.
Changed top_k behavior: it is now disabled with 0 (though -1 is still temporarily accepted).

Migration Steps

Upgrade environment to support PyTorch 2.7.
Update CUDA drivers to support CUDA 12.8 (or use the GitHub artifact for CUDA 12.6).
If using Falcon-H1, install the development version of transformers from source.
Update top_k configurations from -1 to 0 to disable.
For Blackwell performance, install FlashInfer nightly wheel and set VLLM_ATTENTION_BACKEND=FLASHINFER.
Review random seed logic if relying on non-deterministic outputs in V1 engine (default is now 0).

✨ New Features

Initial support for NVIDIA Blackwell with optimized attention and MLP kernels.
Initial DP (Data Parallelism), EP (Expert Parallelism), and PD (Prefill-Decode) support for large scale inference.
Support for new models: MiMo-7B, MiniMax-VL-01, Ovis 1.6/2, GraniteMoeHybrid 4.0, FalconH1, LlamaGuard4, and Qwen2.5-1M.
New /classify endpoint and truncation control for embedding models.
Support for full CUDA graphs in V1 engine.
Added VLLM_ALLOW_INSECURE_SERIALIZATION environment variable for security control.
Support for Quark MXFP4 and nvidia/DeepSeek-R1-FP4 quantization formats.
Migrated documentation from Sphinx to MkDocs.

🐛 Bug Fixes

Fixed image hash collision in certain edge cases.
Fixed numel() downcast in fused_layernorm_dynamic_per_token_quant.cu.
Added contiguous call inside rope kernel wrapper.
Migrated to REGEX library to prevent catastrophic backtracking (side-channel/ReDoS protection).
Fixed side-channel attacks via cache salting.

🔧 Affected Symbols

torchvllm.multimodalLLM.chattop_kV1 EngineFlashInferfused_moePiecewiseBackendMultiprocExecutor

⚡ Deprecations

The --enable-reasoning flag is deprecated.
top_k value of -1 is deprecated in favor of 0.
Python 3.7 type hinting is being updated/removed.