v0.9.1
Breaking Changes📦 vllmView on GitHub →
⚠ 6 breaking✨ 11 features🐛 7 fixes⚡ 5 deprecations🔧 9 symbols
Summary
This release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.
⚠️ Breaking Changes
- Positional arguments other than 'model' are no longer allowed when initializing the LLM class; use keyword arguments instead.
- The 'inputs' argument fallback in Engine classes has been removed.
- Fallbacks for the Embeddings API have been removed.
- Default mean pooling for Qwen2EmbeddingModel has been removed; pooling must now be explicitly defined.
- Custom model implementations must now explicitly override 'get_dummy_text' and 'get_dummy_mm_data'.
- Metrics deprecated in version 0.8 have been completely removed.
Migration Steps
- Update LLM class initializations to use keyword arguments for all parameters except 'model'.
- Update custom model classes to implement 'get_dummy_text' and 'get_dummy_mm_data'.
- Review monitoring systems for removed metrics that were deprecated in 0.8.
- Explicitly configure pooling for Qwen2EmbeddingModel if relying on previous defaults.
- Update Engine class calls to remove the 'inputs' argument if still in use.
✨ New Features
- Support for Data Parallel (DP) Attention and Expert Parallelism (EP) with CUDA graph support.
- Integration of DeepEP dispatch-combine and DeepGEMM kernels.
- Heterogeneous Tensor Parallelism (TP) and FlashInfer backend for NixlConnector.
- API-server scaleout with many-to-many server-engine communications.
- Initial full support for Hybrid Memory Allocator and cross-layer KV sharing.
- FlexAttention support in vLLM V1.
- Support for new models: Magistral, InternVL (LoRA), Minicpm Eagle, NemotronH, and Llama4 vision encoder.
- Hardware support for Blackwell (Cutlass MLA, FlashInfer by default) and NVFP4 (FP4).
- V1 support for CPU backend and IBM POWER11 CPU extension detection.
- New 'run_batch' CLI and endpoint with rerank support.
- Added 'allowed_token_ids' to ChatCompletionRequest.
🐛 Bug Fixes
- Disabled prefix caching by default for benchmarks to ensure accuracy.
- Fixed nomic max_model_len calculation.
- Corrected error message propagation from chat_templating to clients.
- Fixed FA2 MLA accuracy issues.
- Resolved Olmoe model layer issues for TP > 1.
- Fixed TPU CI exit codes and MoE custom kernel imports.
- Fixed Triton unified attention power-of-2 exception on ROCm for Llama4 models.
🔧 Affected Symbols
LLMAsyncLLMEngine.generateQwen2EmbeddingModelChatCompletionRequestKVEventBatchget_dummy_textget_dummy_mm_dataasync_timeoutSampler⚡ Deprecations
- Positional arguments in LLM initialization (except for 'model').
- Unused sync methods in 'async_timeout'.
- Legacy 'inputs' argument in Engine classes.
- Legacy fallback behavior in Embeddings API.
- Default pooling behavior for Qwen2EmbeddingModel.