v0.16.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 51 features🐛 31 fixes⚡ 2 deprecations🔧 29 symbols
Summary
vLLM v0.16.0 introduces full support for Async scheduling with Pipeline Parallelism, a new Realtime WebSocket API, and a major overhaul of XPU platform support by deprecating IPEX in favor of vllm-xpu-kernels. This release also includes extensive model support additions and performance optimizations across various hardware platforms.
⚠️ Breaking Changes
- PyTorch 2.10 upgrade requires updating the environment dependency to use PyTorch 2.10 or newer.
- IPEX is deprecated in favor of vllm-xpu-kernels for XPU platform support.
Migration Steps
- Update your environment to use PyTorch 2.10 or newer.
- If using XPU platforms, replace configurations relying on IPEX with those using vllm-xpu-kernels.
- Remove dependencies or configurations related to the removed BitBlas and Marlin 24 quantization backends.
✨ New Features
- Full support for Async scheduling combined with Pipeline Parallelism, yielding significant throughput improvements.
- Introduction of a new WebSocket-based Realtime API for streaming audio interactions.
- Native NCCL-based weight syncing API for RLHF workflows.
- Layerwise weight reloading support for QeRL.
- Engine pause/resume functionality with request preservation.
- Unified Parallel Drafting implementation for speculative decoding.
- Speculative decoding now supports structured outputs.
- Support for penalty application in Model Runner V2 for speculative decoding.
- XPU platform overhaul including support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE using vllm-xpu-kernels.
- Support for new model architectures including GLM-OCR with MTP, Qwen3-ASR, DeepSeek-OCR-2, Intern-S1-Pro, MiniCPM-o 4.5, openPangu7B-VL, NemotronHPuzzle heterogeneous, MusicFlamingo, FunAudioChat, ColBERT late interaction, voyage-4-nano, and GLM-5.
- Speculative decoding support added for EAGLE3 (Hunyuan/HunyuanVL), AFMoE, and Mistral3.
- LoRA expansion for Gemma3 vision components, Nemotron-H MTP models, and Qwen3 output embedding, along with optimized fused MoE-LoRA kernels.
- Qwen3-Omni transcription support.
- Mistral Large 3 support with FlashInfer MoE.
- LFM2 SigLIP2 intermediate encoder layer support.
- Support for embedding input for disabled modalities.
- FlashInfer TRTLLM BF16 MoE integration for NVIDIA.
- SM100 INT4 W4A16 kernel support.
- SM121 (DGX Spark) CUTLASS support.
- MNNVL protocol support for GB series.
- FlashInfer MLA concat optimization.
- GDN attention layout optimization.
- DeepGEMM FP8 MLA performance improvements.
- wvSplitK_fp8 performance improvements.
- B200 MoE configs for Nemotron Nano and Super B200 TP2 support.
- Mamba selective scan tuning for B200.
- QWEN3-NEXT FP8 tunings and AITER attention backend for AMD ROCm.
- Fused_add_rmsnorm_pad for GPT-OSS on AMD.
- ARM CPU support for KleidiAI INT4 dynamic quant with BF16 activations, NEON BFMMLA BF16 paged attention, and vectorization backend optimization.
- BF16 kernel type support for IBM Z (s390x).
- torch.compile improvements including stopping compilation of identical artifacts and MoE cold start optimization.
- Chat completion streaming optimization.
- ORJSONResponse for faster API responses.
- MoE permute optimization for CUTLASS FP8.
- Shared/routed overlap optimization for latent MoE on Nemotron-H.
- FlashInfer autotune control flag.
- Disaggregated serving improvements including Mooncake connector rework and cross-layer KV cache layout.
- EPLB improvements including capturing logical experts with router replay.
- New quantization methods: FP8 block quant for CompressedTensorsW8A16Fp8, ModelOpt MXFP8 for dense models, NVFP4/FP8 on Turing GPUs, and TP > 4 for FP4 Gemm.
- WebSocket-based streaming API implementation.
- Responses API enhancements to return sampling parameters, token IDs, and prompt token IDs.
- Pooling API standardization.
- Tool calling fixes and GLM-4 incremental string streaming.
- Structured outputs performance optimization with reasoning.
- CLI option `--disable-access-log-for-endpoints`.
- UX improvements: Nested configs in YAML, GGUF repo_id:quant_type syntax, default DeepSeek ReasoningParser thinking, removal of noisy CT warning, early tokenization validation, and reasoning_content backward compatibility.
- run_batch transcription/translation support.
- /server_info collect_env endpoint.
- OTEL tracing during model loading.
- Ability to clear MM and encoder cache.
- HF Hub LoRA resolver.
🐛 Bug Fixes
- Deadlock fix for torchrun PP broadcast.
- Correctness fix for spec tokens with prefill chunks in speculative decoding.
- Fix for GLM-4.7-GPTQ decode and MTP acceptance rate regression.
- DeepSeek V3.2 fast detokenization and tokenizer fix.
- GLM-5 MTP accuracy fix.
- Fix for CPU memory leak from Request reference cycle in prefix caching.
- Fix for DeepSeek R1 CUTLASS MLA on B200.
- QK Norm+RoPE fusion fix on B200+FP8.
- Fix for CUTLASS FP8 blockwise on SM103a.
- Qwen3-Omni startup fix on ROCm.
- Fix for 32-bit indexing assumption in torch.compile.
- Fix for attention fusion pass in torch.compile.
- Fix for grammar bitmask H2D copy on separate stream.
- Fix for early-reject oversized MM requests.
- Fix for multi-turn tool call ID preservation.
- Fix for tool calling indexing double-counting.
- Fix for DSV3.2 fast detokenization in tool calling.
- Fix for MCP tools non-streaming.
- Fix for structured outputs guidance vocab size.
- Fix for multi-document scoring returning single result.
- Fixes for FP8 online quantization memory.
- Fixes for asymmetric W4A16 (ConchLinear) for CT.
- Fixes for DeepSeek V3.2 NVFP4 quantization.
- Fixes for LoRA FP8 quantization.
- Fix for quantized Falcon-H1 model loading.
- Fix for quantized Mamba TP with n_groups=1.
- Fixes for CPU W8A8 with bias and 3D input support.
- Fix for Ray multi-replica single-instance issue in disaggregated serving.
- DP metadata fix for dense models in EPLB.
- Fixes for async double-free in disaggregated serving.
- Fixes for Qwen3-Omni/GLM-4.xV MRoPE positioning.
Affected Symbols
IPEXBitBlasMarlin 24torchrunModel Runner V2ConfigManagerKernelWrapperKernelRegistryPluggableLayerCascade AttentionTriton attentionFlashInferTRTLLMCUTLASSDeepGEMMwvSplitK_fp8AITER attention backendfused_add_rmsnorm_padKleidiAI INT4 dynamic quantNEON BFMMLA BF16 paged attentiontorch.compileMooncake connectorNIXL Connector V2EPLBCompressedTensorsW8A16Fp8ModelOpt MXFP8ScoreRequestDeepSeek ReasoningParserAuthorization header handling
⚡ Deprecations
- BitBlas quantization backend has been removed.
- Marlin 24 quantization backend has been removed.