Change8

v0.20.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking10 features🐛 6 fixes🔧 10 symbols

Summary

v0.20.0 introduces major infrastructure upgrades, including a default switch to CUDA 13.0 and PyTorch 2.11, alongside significant performance enhancements like TurboQuant 2-bit KV cache and the re-enabling of FlashAttention 4 as default prefill backend.

⚠️ Breaking Changes

  • Default CUDA wheel switched to CUDA 13.0, following PyTorch's version policy. Users must ensure their environment matches this new default or explicitly build/install for a different CUDA version.
  • vLLM now ships on PyTorch 2.11 for CUDA environments. XPU environments temporarily remain on torch-xpu 2.10. This breaks environments relying on older PyTorch versions.

Migration Steps

  1. Ensure your environment uses PyTorch compatible with CUDA 13.0 if using the default CUDA build, or manage PyTorch versioning carefully if using XPU.
  2. If using features relying on older Transformers versions, be aware of potential compatibility shifts due to the move to support transformers>=5.

✨ New Features

  • FlashAttention 4 (FA4) re-enabled as the default MLA prefill backend, supporting head-dim 512 and paged-KV on SM90+.
  • Introduction of TurboQuant 2-bit KV cache, offering 4x capacity compression.
  • New end-to-end online quantization frontend.
  • Initial vLLM IR (Intermediate Representation) skeleton with rms_norm op, laying the foundation for future kernel work.
  • Significant advances in Model Runner V2, including full-CUDA-graph for Eagle prefill and auto-resolution of cudagraph mode/sizes.
  • Introduction of RayExecutorV2 for execution.
  • Support for ZenCPU / AMD Zen CPU backend via zentorch.
  • Initial GDN attention support for Qwen3-Next / Qwen3.5 on Intel XPU.
  • Lustre FS checkpoint prefetching enabled by default.
  • Opt-in VLLM_MEDIA_CACHE for media URL caching.

🐛 Bug Fixes

  • Compatibility fixes for Transformers v5, including handling PaddleOCR-VL image processor max_pixels and resolving Mistral YaRN warnings.
  • Removal of piecewise-fallback for eagle draft decodes in Model Runner V2.
  • Removal of MoE DP chunking in the MoE refactor series.
  • Removal of MLA decode output zero-fill in AMD ROCm AITER path.
  • Removal of redundant syncs for pooling, yielding ~3.7% throughput improvement.
  • Removal of GPU↔CPU syncs in prefill and spec-decode paths for GDN/Mamba.

Affected Symbols