Change8

v0.8.3

Breaking Changes
📦 vllmView on GitHub →
2 breaking11 features🐛 7 fixes1 deprecations🔧 9 symbols

Summary

This release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.

⚠️ Breaking Changes

  • Llama 4 support is currently restricted to the V1 engine only.
  • The n-gram interface for speculative decoding has been updated, which may require changes to custom implementations.

Migration Steps

  1. Upgrade huggingface_hub to the minimum required version to enable Xet downloads.
  2. If using Llama 4, ensure the V1 engine is enabled.
  3. Update developer environments to Python 3.12 as per the new recommendation.
  4. Review n-gram interface implementations if using speculative decoding to align with the updated interface.

✨ New Features

  • Day 0 Support for Llama 4 Scout and Maverick models.
  • Native sliding window attention support in V1 engine with hybrid memory allocator.
  • Single node data parallel (DP) support for API server.
  • Support for XpYd disaggregated prefill using MooncakeStore.
  • New model support: Aya Vision, MiniMaxText01, Skywork-R1V, jina-reranker-v2, and Granite Reasoning Parser.
  • V1 engine enhancements: Collective RPC, BitsAndBytes support, and Eagle Proposer for speculative decoding.
  • V1 LoRA support for CPU offloading.
  • Prefix caching now supports SHA256 and MD5 hashing (for FIPS compliance).
  • AMD ROCm: Custom allreduce support and AITER integration for int8 scaled gemm and fused moe.
  • CPU: Support for Multi-Head Latent Attention (MLA).
  • TPU: Optimized all-reduce performance and sliding window support in paged attention.

🐛 Bug Fixes

  • Fixed CUDA kernel index data type in gptq_marlin/awq_marlin_repack.cu.
  • Fixed nightly MLA failure where FA2 + MLA chunked prefill produced incorrect results in V1.
  • Fixed inductor cache issues on max_position_embeddings.
  • Fixed TPU V1 multiprocess profiler and sampler recompilation issues.
  • Resolved conflicting macro names for GGUF kernels.
  • Fixed weight loading for specific models in the Transformers backend.
  • Added workaround for shared field_names in Pydantic model classes.

🔧 Affected Symbols

vllm.v1.engineMooncakeStoreSchedulerInterfaceTransformersModelpydantic.v1vllm.csrc.quantization.gptq_marlinAITERDeepGemmxgrammar

⚡ Deprecations

  • Developing with Python versions older than 3.12 is now discouraged in favor of Python 3.12.