Change8

v0.12.0

Breaking Changes
📦 vllmView on GitHub →
6 breaking8 features🐛 5 fixes7 deprecations🔧 8 symbols

Summary

vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.

⚠️ Breaking Changes

  • PyTorch upgrade to 2.9.0 requires CUDA 12.9 environment.
  • Removed 'num_lookahead_slots' parameter.
  • Removed 'best_of' parameter.
  • Removed LoRA extra vocab support.
  • Mistral format auto-detection now applied for model loading.
  • Online quantization logic moved to 'model.load_weights'.

Migration Steps

  1. Update host environment to CUDA 12.9.
  2. Replace 'best_of' and 'num_lookahead_slots' parameters in API calls as they are removed.
  3. Update LoRA configurations to remove extra vocab dependencies.
  4. Transition away from 'xformers' backend to supported alternatives like FlashInfer or Triton.
  5. Update GGUF loading code to use the new 'repo_id:quant_type' syntax.

✨ New Features

  • GPU Model Runner V2 (Experimental): Refactored execution pipeline with persistent block tables and Triton-native sampler.
  • EAGLE Speculative Decoding: Multi-step CUDA graph support, DP>1 support, and multimodal support.
  • Prefill Context Parallel (PCP): Partitions sequence dimension during prefill for long-sequence inference.
  • AMD ROCm Expansion: Support for DeepSeek v3.2, SparseMLA, and FP8 MLA decode.
  • New Model Support: PLaMo-3, OpenCUA-7B, HunyuanOCR, Mistral Large 3, and Gemma3 GGUF.
  • RLHF Support: Pause and resume generation for asynchronous RL training.
  • Audio Support: Audio embeddings in chat completions and Qwen3 Omni audio-in-video.
  • Optimization Levels: Added -O0 through -O3 flags to trade startup time for performance.

🐛 Bug Fixes

  • Fixed QwenVL cos/sin cache optimization.
  • Removed -1 temperature hack in Triton-native sampler.
  • Reduced DeepGEMM N dim restriction from 128 to 64 multiplier.
  • Improved H200 Fused MoE configuration.
  • Reduced Docker image size by ~200MB.

🔧 Affected Symbols

GPUModelRunnerV2ParallelConfigCompilationConfig.use_inductorSamplingParamsmodel.load_weightsAiterFlashAttentionBackendFusedMoEToolServer

⚡ Deprecations

  • xformers backend is deprecated.
  • Setting 'seed=None' is deprecated.
  • ParallelConfig's direct child EPLB fields (scheduled for removal).
  • guided_* config fields (scheduled for removal).
  • override_pooler_config and disable_log_requests (scheduled for removal).
  • CompilationConfig.use_inductor (scheduled for removal).
  • Old metrics (scheduled for removal).