Change8

v0.11.1

Breaking Changes
📦 vllm
4 breaking10 features🐛 9 fixes3 deprecations🔧 9 symbols

Summary

This release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.

⚠️ Breaking Changes

  • Removed vllm.worker module; update imports to use the new internal structure.
  • Removed MotifForCausalLM model support.
  • Consolidated speculative decode method names for MTP, which may affect custom implementations relying on old naming conventions.
  • Removed V0 conditions for multimodal embeddings merging, requiring migration to V1 logic.

Migration Steps

  1. Update torch to 2.9.0 and CUDA to 12.9.1 to use the new default build features.
  2. Replace imports from vllm.worker with the updated core worker locations.
  3. Update speculative decoding implementations to use consolidated MTP method names.
  4. If using Anthropic clients, point them to the new /v1/messages endpoint on vllm serve.
  5. Update custom model loaders to use VllmConfig from config/vllm.py instead of config/__init__.py.

✨ New Features

  • Updated default build to PyTorch 2.9.0 and CUDA 12.9.1.
  • Added support for Anthropic-compatible /v1/messages API endpoint.
  • Generalized batch-invariant torch.compile support for attention and MoE backends.
  • Added support for DeepSeek-V3.2.
  • Added support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
  • Added Eagle/Eagle3 multimodal support and enablement on Qwen2.5-VL.
  • Added support for Cambricon MLU.
  • Added VLLM_DEBUG_DUMP_PATH environment variable for torch.compile debugging.
  • Added option to restrict media domains for security.
  • Added LoRA support for OPT models.

🐛 Bug Fixes

  • Fixed correctness and stability issues in async scheduling with chunked prefill and structured outputs.
  • Fixed GLM4 MoE Reasoning Parser is_reasoning_end condition.
  • Fixed FlashInfer AOT in release docker images.
  • Fixed memory profiling for scattered multimodal embeddings.
  • Fixed weight loading for Block FP8 Cutlass SM90.
  • Fixed KV scale calculation issues with FP8 quantization in torch.compile.
  • Fixed accuracy issues for TRTLLM FP8 MOE.
  • Fixed Qwen3-VL regression and multi-GPU (PP > 1) loading issues.
  • Fixed MiDashengLM audio encoder mask and quantization issues.

🔧 Affected Symbols

vllm.workerVllmConfigMotifForCausalLMget_input_embeddings_v0Phi4FlashForCausalLMNixlConnectortorch.compileMiDashengLMGLM4 MoE Reasoning Parser

⚡ Deprecations

  • V0 worker module (vllm.worker) has been removed.
  • get_input_embeddings_v0 has been removed.
  • V0 multimodal embedding merge logic is deprecated/removed.