Change8

v0.9.1rc1

Breaking Changes
📦 vllm
3 breaking7 features🐛 8 fixes3 deprecations🔧 9 symbols

Summary

This release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.

⚠️ Breaking Changes

  • Disallowed positional arguments other than 'model' when initializing the LLM class. Users must now use keyword arguments for all other parameters.
  • Required overriding 'get_dummy_text' and 'get_dummy_mm_data' in model implementations, which may break custom model integrations that relied on base class defaults.
  • Removed metrics that were deprecated in version 0.8, which may break monitoring dashboards or integrations relying on those specific keys.

Migration Steps

  1. Update LLM class initialization to use keyword arguments for all parameters except the model path.
  2. If using custom models, implement 'get_dummy_text' and 'get_dummy_mm_data' methods.
  3. Update monitoring systems to remove references to metrics deprecated in v0.8.
  4. Upgrade Pydantic if running on Python 3.10 to avoid compatibility errors.

✨ New Features

  • Added quantization support for Neuron devices.
  • Added multi-LoRA support for Neuron and V1 TPU backend.
  • Added LoRA support for Beam Search and InternVL models.
  • Introduced 'run batch' command to the CLI.
  • Enabled CUDA graphs for DP + All2All kernels and added ability to use CUDAGraphs with use_inductor=False.
  • Support for datasets in 'vllm bench serve'.
  • Converted configurations to Pydantic dataclasses with mypy checks enabled.

🐛 Bug Fixes

  • Fixed FA2 MLA accuracy issues.
  • Correctly propagate error messages from chat_templating step to the client.
  • Fixed nomic max_model_len and ROCm power-of-2 exceptions in triton_unified_attention.
  • Fixed eagle bug and lm_head for multimodal models in V1.
  • Ensured tensors are contiguous during serialization to prevent corruption.
  • Fixed Pydantic errors on Python 3.10.
  • Fixed TPU CI exit codes and MoE custom kernel imports.
  • Disabled prefix caching by default for benchmarks to ensure accurate results.

🔧 Affected Symbols

LLMget_dummy_textget_dummy_mm_dataasync_timeoutPiecewiseCompileInterpreterAutoWeightsLoadervllm.bench.servetriton_unified_attentionpydantic.dataclasses

⚡ Deprecations

  • Removed unused sync methods in 'async_timeout'.
  • Removed fallbacks for the Embeddings API.
  • Deprecated positional arguments for LLM initialization (except for the model name/path).