Change8

v0.13.0

Breaking Changes
📦 vllmView on GitHub →
7 breaking10 features🐛 6 fixes2 deprecations🔧 9 symbols

Summary

vLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.

⚠️ Breaking Changes

  • PassConfig flags have been renamed per RFC #27995.
  • The environment variable VLLM_ATTENTION_BACKEND has been removed; use the --attention-backend CLI argument instead.
  • The -O.xx flag has been removed.
  • Deprecated plugin and compilation fields have been removed.
  • Deprecated task, seed, and Multi-Modal (MM) settings have been removed.
  • Removed embed_input_ids and embed_multimodal fallback mechanisms.
  • The tokenizer setter has been removed.

Migration Steps

  1. Update deployment scripts to replace VLLM_ATTENTION_BACKEND environment variable with --attention-backend CLI flag.
  2. Review and update PassConfig flag names in custom configurations to match RFC #27995.
  3. Replace usage of --convert reward with --convert embed.
  4. Remove any usage of the deprecated -O.xx optimization flags.
  5. Ensure tokenizer initialization does not rely on the removed tokenizer setter.

✨ New Features

  • Support for new models: BAGEL, AudioFlamingo3, JAIS 2, and latent MoE architectures.
  • Added tool parsers for DeepSeek-V3.2, Gigachat 3, and Holo2 reasoning.
  • NVIDIA Blackwell Ultra (SM103/GB300) support with CUDA 13.
  • Whisper model performance overhaul (~3x speedup) with CPU backend support.
  • Introduction of Model Runner V2 with min-p sampling and NaN detection.
  • New MCP (Model Context Protocol) infrastructure for tool use and browser/container integration.
  • Conditional compilation via compile_ranges for selective kernel builds.
  • Support for xxHash high-performance hashing in prefix caching.
  • Multi-vector retrieval API and binary format support for embeddings.
  • Mooncake Transfer Engine for KV connectors and /reset_prefix_cache API.

🐛 Bug Fixes

  • Fixed DeepSeek V3.2 top-k logic and drop_thinking behavior.
  • Resolved Medusa GPU-CPU synchronization issues to avoid blocking.
  • Fixed Anthropic API streaming response issues.
  • Restored MoE + GGUF support for Qwen2 and Qwen3 MoE models.
  • Security fix for CVE-2025-62164.
  • Fixed Triton ScaledMM fallback for AMD ROCm.

🔧 Affected Symbols

AttentionConfigVLLM_ATTENTION_BACKENDPassConfigModelConfigembed_input_idsembed_multimodalselective_state_updatecompile_rangesencoding_format

⚡ Deprecations

  • merge_by_field_config is now deprecated.
  • The --convert reward flag is deprecated in favor of --convert embed.