v0.9.0
Breaking Changes📦 vllmView on GitHub →
⚠ 4 breaking✨ 8 features🐛 5 fixes⚡ 3 deprecations🔧 9 symbols
Summary
vLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.
⚠️ Breaking Changes
- Upgraded to PyTorch 2.7, which changes environment dependencies.
- Removal of CUDA 12.4 support; the default wheel is now CUDA 12.8.
- The seed is now set to 0 by default for V1 Engine, ensuring deterministic outputs even if temperature > 0.
- Changed top_k behavior: it is now disabled with 0 (though -1 is still temporarily accepted).
Migration Steps
- Upgrade environment to support PyTorch 2.7.
- Update CUDA drivers to support CUDA 12.8 (or use the GitHub artifact for CUDA 12.6).
- If using Falcon-H1, install the development version of transformers from source.
- Update top_k configurations from -1 to 0 to disable.
- For Blackwell performance, install FlashInfer nightly wheel and set VLLM_ATTENTION_BACKEND=FLASHINFER.
- Review random seed logic if relying on non-deterministic outputs in V1 engine (default is now 0).
✨ New Features
- Initial support for NVIDIA Blackwell with optimized attention and MLP kernels.
- Initial DP (Data Parallelism), EP (Expert Parallelism), and PD (Prefill-Decode) support for large scale inference.
- Support for new models: MiMo-7B, MiniMax-VL-01, Ovis 1.6/2, GraniteMoeHybrid 4.0, FalconH1, LlamaGuard4, and Qwen2.5-1M.
- New /classify endpoint and truncation control for embedding models.
- Support for full CUDA graphs in V1 engine.
- Added VLLM_ALLOW_INSECURE_SERIALIZATION environment variable for security control.
- Support for Quark MXFP4 and nvidia/DeepSeek-R1-FP4 quantization formats.
- Migrated documentation from Sphinx to MkDocs.
🐛 Bug Fixes
- Fixed image hash collision in certain edge cases.
- Fixed numel() downcast in fused_layernorm_dynamic_per_token_quant.cu.
- Added contiguous call inside rope kernel wrapper.
- Migrated to REGEX library to prevent catastrophic backtracking (side-channel/ReDoS protection).
- Fixed side-channel attacks via cache salting.
🔧 Affected Symbols
torchvllm.multimodalLLM.chattop_kV1 EngineFlashInferfused_moePiecewiseBackendMultiprocExecutor⚡ Deprecations
- The --enable-reasoning flag is deprecated.
- top_k value of -1 is deprecated in favor of 0.
- Python 3.7 type hinting is being updated/removed.