v0.8.4

Breaking Changes

📅 Apr 14, 2025📦 vllmView on GitHub →

⚠ 3 breaking✨ 10 features🐛 7 fixes🔧 9 symbols

Summary

This release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.

⚠️ Breaking Changes

The default structured output backend is now set to 'auto' in the V1 engine.
Multi-input support is now enabled by default in the V1 engine.
The default 'max_num_seqs' for the V1 engine has been reverted to V0 values for most hardware platforms.

Migration Steps

If using Llama4, update immediately to apply critical accuracy fixes.
Review custom structured output configurations as the default backend is now 'auto'.
Check memory settings as max-model-len estimation now uses available KV cache memory.

✨ New Features

Support for Llama4 models with accuracy fixes and performance enhancements.
Support for new models: Qwen3, Qwen3MoE, smolvlm, jina-embeddings-v3, InternVL3, and GLM-4-0414.
Added support for TorchAO quantization.
Support for matryoshka representation and embedding API dimensions.
Enabled regex support with xgrammar in V0 engine.
DeepSeek MLA optimization with a new merge_attn_states CUDA kernel (3x speedup).
Intel-Gaudi: Multi-step scheduling implementation for HPU.
TPU: Support for torch.compile via XLA backend.
V1 Engine: Zero-copy tensor/ndarray serialization and Eagle Model loading support.
Added hf_token to EngineArgs for authenticated Hugging Face downloads.

🐛 Bug Fixes

Fixed Llama4 qknorm sharing across heads.
Fixed Index Error when single request is near max context in Llama4.
Fixed LoRA kernel processing order.
Fixed marlin kernel use_atomic_add support when using V1 engine.
Fixed tool chat templates for Llama 3.2 and toolace.
Fixed guidance backend for Qwen models.
Fixed ChatGLMForConditionalGeneration support.

🔧 Affected Symbols

EngineArgsAutoWeightsLoaderLoadConfigParallelConfigPlatform.supports_structured_outputfused_moe_kernelCompressedTensorsW8A8Fp8MoEMethodChatGLMForConditionalGenerationTeleChat2ForCausalLM