v0.9.1rc1
Breaking Changes📦 vllm
⚠ 3 breaking✨ 7 features🐛 8 fixes⚡ 3 deprecations🔧 9 symbols
Summary
This release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.
⚠️ Breaking Changes
- Disallowed positional arguments other than 'model' when initializing the LLM class. Users must now use keyword arguments for all other parameters.
- Required overriding 'get_dummy_text' and 'get_dummy_mm_data' in model implementations, which may break custom model integrations that relied on base class defaults.
- Removed metrics that were deprecated in version 0.8, which may break monitoring dashboards or integrations relying on those specific keys.
Migration Steps
- Update LLM class initialization to use keyword arguments for all parameters except the model path.
- If using custom models, implement 'get_dummy_text' and 'get_dummy_mm_data' methods.
- Update monitoring systems to remove references to metrics deprecated in v0.8.
- Upgrade Pydantic if running on Python 3.10 to avoid compatibility errors.
✨ New Features
- Added quantization support for Neuron devices.
- Added multi-LoRA support for Neuron and V1 TPU backend.
- Added LoRA support for Beam Search and InternVL models.
- Introduced 'run batch' command to the CLI.
- Enabled CUDA graphs for DP + All2All kernels and added ability to use CUDAGraphs with use_inductor=False.
- Support for datasets in 'vllm bench serve'.
- Converted configurations to Pydantic dataclasses with mypy checks enabled.
🐛 Bug Fixes
- Fixed FA2 MLA accuracy issues.
- Correctly propagate error messages from chat_templating step to the client.
- Fixed nomic max_model_len and ROCm power-of-2 exceptions in triton_unified_attention.
- Fixed eagle bug and lm_head for multimodal models in V1.
- Ensured tensors are contiguous during serialization to prevent corruption.
- Fixed Pydantic errors on Python 3.10.
- Fixed TPU CI exit codes and MoE custom kernel imports.
- Disabled prefix caching by default for benchmarks to ensure accurate results.
🔧 Affected Symbols
LLMget_dummy_textget_dummy_mm_dataasync_timeoutPiecewiseCompileInterpreterAutoWeightsLoadervllm.bench.servetriton_unified_attentionpydantic.dataclasses⚡ Deprecations
- Removed unused sync methods in 'async_timeout'.
- Removed fallbacks for the Embeddings API.
- Deprecated positional arguments for LLM initialization (except for the model name/path).