v0.8.0rc2

Breaking Changes

📅 Mar 17, 2025📦 vllmView on GitHub →

⚠ 1 breaking✨ 4 features🐛 7 fixes🔧 8 symbols

Summary

This release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.

⚠️ Breaking Changes

The input cache client has been removed from the V1 engine. Users relying on this specific client for caching will need to transition to alternative caching mechanisms provided by the engine.

Migration Steps

Remove any references to the 'input cache client' in V1 engine configurations.
If using TPU, ensure your environment is compatible with the updated ragged paged attention kernel which no longer requires padding.

✨ New Features

Added a patch merger utility.
Added a --seed option to offline multi-modal examples.
Enabled MLA (Multi-head Latent Attention) by default for the V1 engine.
Applied ragged paged attention kernel fix and removed padding for TPU on V1.

🐛 Bug Fixes

Fixed Ultravox model support on V1 engine.
Fixed ROCm tests by using the spawn method for starting new processes.
Fixed Mixtral model to correctly use the head_dim configuration argument.
Fixed structured output matcher construction by using vocab_size.
Restricted Gemma3 multi-modal support to V0 only for the time being.
Fixed misleading log messages during multi-modal profiling.
Fixed linting issues (line length) in pixtral.py.

🔧 Affected Symbols

input cache clientMLAUltravoxMixtralGemma3pixtral.pysetup.pyXPU