v0.13.0
Breaking Changes📦 vllmView on GitHub →
⚠ 7 breaking✨ 10 features🐛 6 fixes⚡ 2 deprecations🔧 9 symbols
Summary
vLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.
⚠️ Breaking Changes
- PassConfig flags have been renamed per RFC #27995.
- The environment variable VLLM_ATTENTION_BACKEND has been removed; use the --attention-backend CLI argument instead.
- The -O.xx flag has been removed.
- Deprecated plugin and compilation fields have been removed.
- Deprecated task, seed, and Multi-Modal (MM) settings have been removed.
- Removed embed_input_ids and embed_multimodal fallback mechanisms.
- The tokenizer setter has been removed.
Migration Steps
- Update deployment scripts to replace VLLM_ATTENTION_BACKEND environment variable with --attention-backend CLI flag.
- Review and update PassConfig flag names in custom configurations to match RFC #27995.
- Replace usage of --convert reward with --convert embed.
- Remove any usage of the deprecated -O.xx optimization flags.
- Ensure tokenizer initialization does not rely on the removed tokenizer setter.
✨ New Features
- Support for new models: BAGEL, AudioFlamingo3, JAIS 2, and latent MoE architectures.
- Added tool parsers for DeepSeek-V3.2, Gigachat 3, and Holo2 reasoning.
- NVIDIA Blackwell Ultra (SM103/GB300) support with CUDA 13.
- Whisper model performance overhaul (~3x speedup) with CPU backend support.
- Introduction of Model Runner V2 with min-p sampling and NaN detection.
- New MCP (Model Context Protocol) infrastructure for tool use and browser/container integration.
- Conditional compilation via compile_ranges for selective kernel builds.
- Support for xxHash high-performance hashing in prefix caching.
- Multi-vector retrieval API and binary format support for embeddings.
- Mooncake Transfer Engine for KV connectors and /reset_prefix_cache API.
🐛 Bug Fixes
- Fixed DeepSeek V3.2 top-k logic and drop_thinking behavior.
- Resolved Medusa GPU-CPU synchronization issues to avoid blocking.
- Fixed Anthropic API streaming response issues.
- Restored MoE + GGUF support for Qwen2 and Qwen3 MoE models.
- Security fix for CVE-2025-62164.
- Fixed Triton ScaledMM fallback for AMD ROCm.
🔧 Affected Symbols
AttentionConfigVLLM_ATTENTION_BACKENDPassConfigModelConfigembed_input_idsembed_multimodalselective_state_updatecompile_rangesencoding_format⚡ Deprecations
- merge_by_field_config is now deprecated.
- The --convert reward flag is deprecated in favor of --convert embed.