v0.13.0

Breaking Changes

📅 Dec 19, 2025📦 vllmView on GitHub →

⚠ 7 breaking✨ 10 features🐛 6 fixes⚡ 2 deprecations🔧 9 symbols

Summary

vLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.

⚠️ Breaking Changes

PassConfig flags have been renamed per RFC #27995.
The environment variable VLLM_ATTENTION_BACKEND has been removed; use the --attention-backend CLI argument instead.
The -O.xx flag has been removed.
Deprecated plugin and compilation fields have been removed.
Deprecated task, seed, and Multi-Modal (MM) settings have been removed.
Removed embed_input_ids and embed_multimodal fallback mechanisms.
The tokenizer setter has been removed.

Migration Steps

Update deployment scripts to replace VLLM_ATTENTION_BACKEND environment variable with --attention-backend CLI flag.
Review and update PassConfig flag names in custom configurations to match RFC #27995.
Replace usage of --convert reward with --convert embed.
Remove any usage of the deprecated -O.xx optimization flags.
Ensure tokenizer initialization does not rely on the removed tokenizer setter.

✨ New Features

Support for new models: BAGEL, AudioFlamingo3, JAIS 2, and latent MoE architectures.
Added tool parsers for DeepSeek-V3.2, Gigachat 3, and Holo2 reasoning.
NVIDIA Blackwell Ultra (SM103/GB300) support with CUDA 13.
Whisper model performance overhaul (~3x speedup) with CPU backend support.
Introduction of Model Runner V2 with min-p sampling and NaN detection.
New MCP (Model Context Protocol) infrastructure for tool use and browser/container integration.
Conditional compilation via compile_ranges for selective kernel builds.
Support for xxHash high-performance hashing in prefix caching.
Multi-vector retrieval API and binary format support for embeddings.
Mooncake Transfer Engine for KV connectors and /reset_prefix_cache API.

🐛 Bug Fixes

Fixed DeepSeek V3.2 top-k logic and drop_thinking behavior.
Resolved Medusa GPU-CPU synchronization issues to avoid blocking.
Fixed Anthropic API streaming response issues.
Restored MoE + GGUF support for Qwen2 and Qwen3 MoE models.
Security fix for CVE-2025-62164.
Fixed Triton ScaledMM fallback for AMD ROCm.

🔧 Affected Symbols

AttentionConfigVLLM_ATTENTION_BACKENDPassConfigModelConfigembed_input_idsembed_multimodalselective_state_updatecompile_rangesencoding_format

⚡ Deprecations

merge_by_field_config is now deprecated.
The --convert reward flag is deprecated in favor of --convert embed.