Change8

v0.9.1

Breaking Changes
📦 vllmView on GitHub →
6 breaking11 features🐛 7 fixes5 deprecations🔧 9 symbols

Summary

This release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.

⚠️ Breaking Changes

  • Positional arguments other than 'model' are no longer allowed when initializing the LLM class; use keyword arguments instead.
  • The 'inputs' argument fallback in Engine classes has been removed.
  • Fallbacks for the Embeddings API have been removed.
  • Default mean pooling for Qwen2EmbeddingModel has been removed; pooling must now be explicitly defined.
  • Custom model implementations must now explicitly override 'get_dummy_text' and 'get_dummy_mm_data'.
  • Metrics deprecated in version 0.8 have been completely removed.

Migration Steps

  1. Update LLM class initializations to use keyword arguments for all parameters except 'model'.
  2. Update custom model classes to implement 'get_dummy_text' and 'get_dummy_mm_data'.
  3. Review monitoring systems for removed metrics that were deprecated in 0.8.
  4. Explicitly configure pooling for Qwen2EmbeddingModel if relying on previous defaults.
  5. Update Engine class calls to remove the 'inputs' argument if still in use.

✨ New Features

  • Support for Data Parallel (DP) Attention and Expert Parallelism (EP) with CUDA graph support.
  • Integration of DeepEP dispatch-combine and DeepGEMM kernels.
  • Heterogeneous Tensor Parallelism (TP) and FlashInfer backend for NixlConnector.
  • API-server scaleout with many-to-many server-engine communications.
  • Initial full support for Hybrid Memory Allocator and cross-layer KV sharing.
  • FlexAttention support in vLLM V1.
  • Support for new models: Magistral, InternVL (LoRA), Minicpm Eagle, NemotronH, and Llama4 vision encoder.
  • Hardware support for Blackwell (Cutlass MLA, FlashInfer by default) and NVFP4 (FP4).
  • V1 support for CPU backend and IBM POWER11 CPU extension detection.
  • New 'run_batch' CLI and endpoint with rerank support.
  • Added 'allowed_token_ids' to ChatCompletionRequest.

🐛 Bug Fixes

  • Disabled prefix caching by default for benchmarks to ensure accuracy.
  • Fixed nomic max_model_len calculation.
  • Corrected error message propagation from chat_templating to clients.
  • Fixed FA2 MLA accuracy issues.
  • Resolved Olmoe model layer issues for TP > 1.
  • Fixed TPU CI exit codes and MoE custom kernel imports.
  • Fixed Triton unified attention power-of-2 exception on ROCm for Llama4 models.

🔧 Affected Symbols

LLMAsyncLLMEngine.generateQwen2EmbeddingModelChatCompletionRequestKVEventBatchget_dummy_textget_dummy_mm_dataasync_timeoutSampler

⚡ Deprecations

  • Positional arguments in LLM initialization (except for 'model').
  • Unused sync methods in 'async_timeout'.
  • Legacy 'inputs' argument in Engine classes.
  • Legacy fallback behavior in Embeddings API.
  • Default pooling behavior for Qwen2EmbeddingModel.