Change8

v0.16.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking51 features🐛 31 fixes2 deprecations🔧 29 symbols

Summary

vLLM v0.16.0 introduces full support for Async scheduling with Pipeline Parallelism, a new Realtime WebSocket API, and a major overhaul of XPU platform support by deprecating IPEX in favor of vllm-xpu-kernels. This release also includes extensive model support additions and performance optimizations across various hardware platforms.

⚠️ Breaking Changes

  • PyTorch 2.10 upgrade requires updating the environment dependency to use PyTorch 2.10 or newer.
  • IPEX is deprecated in favor of vllm-xpu-kernels for XPU platform support.

Migration Steps

  1. Update your environment to use PyTorch 2.10 or newer.
  2. If using XPU platforms, replace configurations relying on IPEX with those using vllm-xpu-kernels.
  3. Remove dependencies or configurations related to the removed BitBlas and Marlin 24 quantization backends.

✨ New Features

  • Full support for Async scheduling combined with Pipeline Parallelism, yielding significant throughput improvements.
  • Introduction of a new WebSocket-based Realtime API for streaming audio interactions.
  • Native NCCL-based weight syncing API for RLHF workflows.
  • Layerwise weight reloading support for QeRL.
  • Engine pause/resume functionality with request preservation.
  • Unified Parallel Drafting implementation for speculative decoding.
  • Speculative decoding now supports structured outputs.
  • Support for penalty application in Model Runner V2 for speculative decoding.
  • XPU platform overhaul including support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE using vllm-xpu-kernels.
  • Support for new model architectures including GLM-OCR with MTP, Qwen3-ASR, DeepSeek-OCR-2, Intern-S1-Pro, MiniCPM-o 4.5, openPangu7B-VL, NemotronHPuzzle heterogeneous, MusicFlamingo, FunAudioChat, ColBERT late interaction, voyage-4-nano, and GLM-5.
  • Speculative decoding support added for EAGLE3 (Hunyuan/HunyuanVL), AFMoE, and Mistral3.
  • LoRA expansion for Gemma3 vision components, Nemotron-H MTP models, and Qwen3 output embedding, along with optimized fused MoE-LoRA kernels.
  • Qwen3-Omni transcription support.
  • Mistral Large 3 support with FlashInfer MoE.
  • LFM2 SigLIP2 intermediate encoder layer support.
  • Support for embedding input for disabled modalities.
  • FlashInfer TRTLLM BF16 MoE integration for NVIDIA.
  • SM100 INT4 W4A16 kernel support.
  • SM121 (DGX Spark) CUTLASS support.
  • MNNVL protocol support for GB series.
  • FlashInfer MLA concat optimization.
  • GDN attention layout optimization.
  • DeepGEMM FP8 MLA performance improvements.
  • wvSplitK_fp8 performance improvements.
  • B200 MoE configs for Nemotron Nano and Super B200 TP2 support.
  • Mamba selective scan tuning for B200.
  • QWEN3-NEXT FP8 tunings and AITER attention backend for AMD ROCm.
  • Fused_add_rmsnorm_pad for GPT-OSS on AMD.
  • ARM CPU support for KleidiAI INT4 dynamic quant with BF16 activations, NEON BFMMLA BF16 paged attention, and vectorization backend optimization.
  • BF16 kernel type support for IBM Z (s390x).
  • torch.compile improvements including stopping compilation of identical artifacts and MoE cold start optimization.
  • Chat completion streaming optimization.
  • ORJSONResponse for faster API responses.
  • MoE permute optimization for CUTLASS FP8.
  • Shared/routed overlap optimization for latent MoE on Nemotron-H.
  • FlashInfer autotune control flag.
  • Disaggregated serving improvements including Mooncake connector rework and cross-layer KV cache layout.
  • EPLB improvements including capturing logical experts with router replay.
  • New quantization methods: FP8 block quant for CompressedTensorsW8A16Fp8, ModelOpt MXFP8 for dense models, NVFP4/FP8 on Turing GPUs, and TP > 4 for FP4 Gemm.
  • WebSocket-based streaming API implementation.
  • Responses API enhancements to return sampling parameters, token IDs, and prompt token IDs.
  • Pooling API standardization.
  • Tool calling fixes and GLM-4 incremental string streaming.
  • Structured outputs performance optimization with reasoning.
  • CLI option `--disable-access-log-for-endpoints`.
  • UX improvements: Nested configs in YAML, GGUF repo_id:quant_type syntax, default DeepSeek ReasoningParser thinking, removal of noisy CT warning, early tokenization validation, and reasoning_content backward compatibility.
  • run_batch transcription/translation support.
  • /server_info collect_env endpoint.
  • OTEL tracing during model loading.
  • Ability to clear MM and encoder cache.
  • HF Hub LoRA resolver.

🐛 Bug Fixes

  • Deadlock fix for torchrun PP broadcast.
  • Correctness fix for spec tokens with prefill chunks in speculative decoding.
  • Fix for GLM-4.7-GPTQ decode and MTP acceptance rate regression.
  • DeepSeek V3.2 fast detokenization and tokenizer fix.
  • GLM-5 MTP accuracy fix.
  • Fix for CPU memory leak from Request reference cycle in prefix caching.
  • Fix for DeepSeek R1 CUTLASS MLA on B200.
  • QK Norm+RoPE fusion fix on B200+FP8.
  • Fix for CUTLASS FP8 blockwise on SM103a.
  • Qwen3-Omni startup fix on ROCm.
  • Fix for 32-bit indexing assumption in torch.compile.
  • Fix for attention fusion pass in torch.compile.
  • Fix for grammar bitmask H2D copy on separate stream.
  • Fix for early-reject oversized MM requests.
  • Fix for multi-turn tool call ID preservation.
  • Fix for tool calling indexing double-counting.
  • Fix for DSV3.2 fast detokenization in tool calling.
  • Fix for MCP tools non-streaming.
  • Fix for structured outputs guidance vocab size.
  • Fix for multi-document scoring returning single result.
  • Fixes for FP8 online quantization memory.
  • Fixes for asymmetric W4A16 (ConchLinear) for CT.
  • Fixes for DeepSeek V3.2 NVFP4 quantization.
  • Fixes for LoRA FP8 quantization.
  • Fix for quantized Falcon-H1 model loading.
  • Fix for quantized Mamba TP with n_groups=1.
  • Fixes for CPU W8A8 with bias and 3D input support.
  • Fix for Ray multi-replica single-instance issue in disaggregated serving.
  • DP metadata fix for dense models in EPLB.
  • Fixes for async double-free in disaggregated serving.
  • Fixes for Qwen3-Omni/GLM-4.xV MRoPE positioning.

Affected Symbols

⚡ Deprecations

  • BitBlas quantization backend has been removed.
  • Marlin 24 quantization backend has been removed.