v0.8.3
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 11 features🐛 7 fixes⚡ 1 deprecations🔧 9 symbols
Summary
This release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.
⚠️ Breaking Changes
- Llama 4 support is currently restricted to the V1 engine only.
- The n-gram interface for speculative decoding has been updated, which may require changes to custom implementations.
Migration Steps
- Upgrade huggingface_hub to the minimum required version to enable Xet downloads.
- If using Llama 4, ensure the V1 engine is enabled.
- Update developer environments to Python 3.12 as per the new recommendation.
- Review n-gram interface implementations if using speculative decoding to align with the updated interface.
✨ New Features
- Day 0 Support for Llama 4 Scout and Maverick models.
- Native sliding window attention support in V1 engine with hybrid memory allocator.
- Single node data parallel (DP) support for API server.
- Support for XpYd disaggregated prefill using MooncakeStore.
- New model support: Aya Vision, MiniMaxText01, Skywork-R1V, jina-reranker-v2, and Granite Reasoning Parser.
- V1 engine enhancements: Collective RPC, BitsAndBytes support, and Eagle Proposer for speculative decoding.
- V1 LoRA support for CPU offloading.
- Prefix caching now supports SHA256 and MD5 hashing (for FIPS compliance).
- AMD ROCm: Custom allreduce support and AITER integration for int8 scaled gemm and fused moe.
- CPU: Support for Multi-Head Latent Attention (MLA).
- TPU: Optimized all-reduce performance and sliding window support in paged attention.
🐛 Bug Fixes
- Fixed CUDA kernel index data type in gptq_marlin/awq_marlin_repack.cu.
- Fixed nightly MLA failure where FA2 + MLA chunked prefill produced incorrect results in V1.
- Fixed inductor cache issues on max_position_embeddings.
- Fixed TPU V1 multiprocess profiler and sampler recompilation issues.
- Resolved conflicting macro names for GGUF kernels.
- Fixed weight loading for specific models in the Transformers backend.
- Added workaround for shared field_names in Pydantic model classes.
🔧 Affected Symbols
vllm.v1.engineMooncakeStoreSchedulerInterfaceTransformersModelpydantic.v1vllm.csrc.quantization.gptq_marlinAITERDeepGemmxgrammar⚡ Deprecations
- Developing with Python versions older than 3.12 is now discouraged in favor of Python 3.12.