v0.8.0rc2
Breaking Changes📦 vllmView on GitHub →
⚠ 1 breaking✨ 4 features🐛 7 fixes🔧 8 symbols
Summary
This release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.
⚠️ Breaking Changes
- The input cache client has been removed from the V1 engine. Users relying on this specific client for caching will need to transition to alternative caching mechanisms provided by the engine.
Migration Steps
- Remove any references to the 'input cache client' in V1 engine configurations.
- If using TPU, ensure your environment is compatible with the updated ragged paged attention kernel which no longer requires padding.
✨ New Features
- Added a patch merger utility.
- Added a --seed option to offline multi-modal examples.
- Enabled MLA (Multi-head Latent Attention) by default for the V1 engine.
- Applied ragged paged attention kernel fix and removed padding for TPU on V1.
🐛 Bug Fixes
- Fixed Ultravox model support on V1 engine.
- Fixed ROCm tests by using the spawn method for starting new processes.
- Fixed Mixtral model to correctly use the head_dim configuration argument.
- Fixed structured output matcher construction by using vocab_size.
- Restricted Gemma3 multi-modal support to V0 only for the time being.
- Fixed misleading log messages during multi-modal profiling.
- Fixed linting issues (line length) in pixtral.py.
🔧 Affected Symbols
input cache clientMLAUltravoxMixtralGemma3pixtral.pysetup.pyXPU