v0.8.4
Breaking Changes📦 vllmView on GitHub →
⚠ 3 breaking✨ 10 features🐛 7 fixes🔧 9 symbols
Summary
This release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.
⚠️ Breaking Changes
- The default structured output backend is now set to 'auto' in the V1 engine.
- Multi-input support is now enabled by default in the V1 engine.
- The default 'max_num_seqs' for the V1 engine has been reverted to V0 values for most hardware platforms.
Migration Steps
- If using Llama4, update immediately to apply critical accuracy fixes.
- Review custom structured output configurations as the default backend is now 'auto'.
- Check memory settings as max-model-len estimation now uses available KV cache memory.
✨ New Features
- Support for Llama4 models with accuracy fixes and performance enhancements.
- Support for new models: Qwen3, Qwen3MoE, smolvlm, jina-embeddings-v3, InternVL3, and GLM-4-0414.
- Added support for TorchAO quantization.
- Support for matryoshka representation and embedding API dimensions.
- Enabled regex support with xgrammar in V0 engine.
- DeepSeek MLA optimization with a new merge_attn_states CUDA kernel (3x speedup).
- Intel-Gaudi: Multi-step scheduling implementation for HPU.
- TPU: Support for torch.compile via XLA backend.
- V1 Engine: Zero-copy tensor/ndarray serialization and Eagle Model loading support.
- Added hf_token to EngineArgs for authenticated Hugging Face downloads.
🐛 Bug Fixes
- Fixed Llama4 qknorm sharing across heads.
- Fixed Index Error when single request is near max context in Llama4.
- Fixed LoRA kernel processing order.
- Fixed marlin kernel use_atomic_add support when using V1 engine.
- Fixed tool chat templates for Llama 3.2 and toolace.
- Fixed guidance backend for Qwen models.
- Fixed ChatGLMForConditionalGeneration support.
🔧 Affected Symbols
EngineArgsAutoWeightsLoaderLoadConfigParallelConfigPlatform.supports_structured_outputfused_moe_kernelCompressedTensorsW8A8Fp8MoEMethodChatGLMForConditionalGenerationTeleChat2ForCausalLM