v0.11.1
Breaking Changes📦 vllm
⚠ 4 breaking✨ 10 features🐛 9 fixes⚡ 3 deprecations🔧 9 symbols
Summary
This release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.
⚠️ Breaking Changes
- Removed vllm.worker module; update imports to use the new internal structure.
- Removed MotifForCausalLM model support.
- Consolidated speculative decode method names for MTP, which may affect custom implementations relying on old naming conventions.
- Removed V0 conditions for multimodal embeddings merging, requiring migration to V1 logic.
Migration Steps
- Update torch to 2.9.0 and CUDA to 12.9.1 to use the new default build features.
- Replace imports from vllm.worker with the updated core worker locations.
- Update speculative decoding implementations to use consolidated MTP method names.
- If using Anthropic clients, point them to the new /v1/messages endpoint on vllm serve.
- Update custom model loaders to use VllmConfig from config/vllm.py instead of config/__init__.py.
✨ New Features
- Updated default build to PyTorch 2.9.0 and CUDA 12.9.1.
- Added support for Anthropic-compatible /v1/messages API endpoint.
- Generalized batch-invariant torch.compile support for attention and MoE backends.
- Added support for DeepSeek-V3.2.
- Added support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
- Added Eagle/Eagle3 multimodal support and enablement on Qwen2.5-VL.
- Added support for Cambricon MLU.
- Added VLLM_DEBUG_DUMP_PATH environment variable for torch.compile debugging.
- Added option to restrict media domains for security.
- Added LoRA support for OPT models.
🐛 Bug Fixes
- Fixed correctness and stability issues in async scheduling with chunked prefill and structured outputs.
- Fixed GLM4 MoE Reasoning Parser is_reasoning_end condition.
- Fixed FlashInfer AOT in release docker images.
- Fixed memory profiling for scattered multimodal embeddings.
- Fixed weight loading for Block FP8 Cutlass SM90.
- Fixed KV scale calculation issues with FP8 quantization in torch.compile.
- Fixed accuracy issues for TRTLLM FP8 MOE.
- Fixed Qwen3-VL regression and multi-GPU (PP > 1) loading issues.
- Fixed MiDashengLM audio encoder mask and quantization issues.
🔧 Affected Symbols
vllm.workerVllmConfigMotifForCausalLMget_input_embeddings_v0Phi4FlashForCausalLMNixlConnectortorch.compileMiDashengLMGLM4 MoE Reasoning Parser⚡ Deprecations
- V0 worker module (vllm.worker) has been removed.
- get_input_embeddings_v0 has been removed.
- V0 multimodal embedding merge logic is deprecated/removed.