v0.20.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 10 features🐛 6 fixes🔧 10 symbols
Summary
v0.20.0 introduces major infrastructure upgrades, including a default switch to CUDA 13.0 and PyTorch 2.11, alongside significant performance enhancements like TurboQuant 2-bit KV cache and the re-enabling of FlashAttention 4 as default prefill backend.
⚠️ Breaking Changes
- Default CUDA wheel switched to CUDA 13.0, following PyTorch's version policy. Users must ensure their environment matches this new default or explicitly build/install for a different CUDA version.
- vLLM now ships on PyTorch 2.11 for CUDA environments. XPU environments temporarily remain on torch-xpu 2.10. This breaks environments relying on older PyTorch versions.
Migration Steps
- Ensure your environment uses PyTorch compatible with CUDA 13.0 if using the default CUDA build, or manage PyTorch versioning carefully if using XPU.
- If using features relying on older Transformers versions, be aware of potential compatibility shifts due to the move to support transformers>=5.
✨ New Features
- FlashAttention 4 (FA4) re-enabled as the default MLA prefill backend, supporting head-dim 512 and paged-KV on SM90+.
- Introduction of TurboQuant 2-bit KV cache, offering 4x capacity compression.
- New end-to-end online quantization frontend.
- Initial vLLM IR (Intermediate Representation) skeleton with rms_norm op, laying the foundation for future kernel work.
- Significant advances in Model Runner V2, including full-CUDA-graph for Eagle prefill and auto-resolution of cudagraph mode/sizes.
- Introduction of RayExecutorV2 for execution.
- Support for ZenCPU / AMD Zen CPU backend via zentorch.
- Initial GDN attention support for Qwen3-Next / Qwen3.5 on Intel XPU.
- Lustre FS checkpoint prefetching enabled by default.
- Opt-in VLLM_MEDIA_CACHE for media URL caching.
🐛 Bug Fixes
- Compatibility fixes for Transformers v5, including handling PaddleOCR-VL image processor max_pixels and resolving Mistral YaRN warnings.
- Removal of piecewise-fallback for eagle draft decodes in Model Runner V2.
- Removal of MoE DP chunking in the MoE refactor series.
- Removal of MLA decode output zero-fill in AMD ROCm AITER path.
- Removal of redundant syncs for pooling, yielding ~3.7% throughput improvement.
- Removal of GPU↔CPU syncs in prefill and spec-decode paths for GDN/Mamba.