Change8

v0.15.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking50 features🐛 7 fixes1 deprecations🔧 18 symbols

Summary

This release introduces extensive model support, significant performance enhancements across NVIDIA and AMD hardware (especially for MoE and FP4), and new API features like session-based streaming input. Several deprecated metrics and quantization methods have been removed.

⚠️ Breaking Changes

  • Removed deprecated metric `vllm:time_per_output_token_seconds`; users must switch to using `vllm:inter_token_latency_seconds` instead.
  • Removed deprecated environment variables.

Migration Steps

  1. Replace usage of the deprecated metric `vllm:time_per_output_token_seconds` with `vllm:inter_token_latency_seconds`.
  2. If using DeepSpeedFp8 quantization, migrate to an alternative method as it has been removed.
  3. If using RTN quantization, migrate to an alternative method as it has been removed.
  4. If using HQQ quantization, migrate to an alternative method as it is now deprecated.

✨ New Features

  • Added support for Kimi-K2.5, Molmo2, Step3vl 10B, Step1, GLM-Lite, and Eagle2.5-8B VLM model architectures.
  • Added LoRA support for Nemotron-H, InternVL2, and MiniMax M2.
  • Enabled speculative decoding for EAGLE3 (Pixtral/LlavaForConditionalGeneration), Qwen3 VL MoE, and added general draft model support.
  • Introduced BGE-M3 sparse embeddings and ColBERT embeddings support.
  • Implemented Voxtral streaming architecture enhancements.
  • Added SharedFusedMoE support for Qwen3MoE.
  • Enabled dynamic resolution for Nemotron Nano VL.
  • Enabled Molmo2 vision backbone quantization.
  • Enabled `--async-scheduling` to work concurrently with pipeline parallelism.
  • Implemented Mamba prefix caching using `--enable-prefix-caching --mamba-cache-mode align` for Mamba/hybrid models, yielding ~2x speedup.
  • Introduced session-based streaming input, accepting async generators producing `StreamingInput` objects for interactive workloads like ASR.
  • Model Runner V2 now supports VLM.
  • Implemented inplace loading for LoRA for improved memory efficiency.
  • Added support for torch.compile inductor artifacts via AOT compilation.
  • FlashInfer MLA is now the default MLA backend on Blackwell GPUs, with TRTLLM as the default prefill backend.
  • Implemented grouped topk kernel fusion for MoE, resulting in 1.2-2% E2E throughput improvement.
  • Improved NVFP4 small-batch decoding performance.
  • Enabled faster cold start for MoEs when using torch.compile.
  • Optimized FP4 quantization on Blackwell (SM100F) using 256-bit loads, leading to up to 65% speedup and ~4% E2E throughput improvement.
  • Added topk_sigmoid kernel for MoE routing.
  • Added atomics reduce counting for SplitK skinny GEMMs.
  • Fused cat+quant operation for FP8 KV cache in MLA.
  • Enabled SiluAndMul and QuantFP8 CustomOp compilation via torch.compile.
  • Improved Triton prefill attention performance via torch.compile.
  • Added MoRI EP (Expert Parallel all2all backend) for AMD ROCm.
  • Improved AMD ROCm attention via Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend.
  • Enabled FP4 MLA projection GEMMs with dynamic quantization on AMD.
  • Enabled Flash Attention Triton backend on AMD RDNA3/RDNA4 consumer GPUs.
  • Added pipeline parallelism support for TPUs.
  • Added backend option for TPU execution.
  • Implemented AgRsAll2AllManager for distributed communication on Intel XPU.
  • Enabled NUMA-aware acceleration for TP/DP inference on ARM CPUs.
  • Added torch.compile support for Whisper models.
  • Fixed platform compatibility issue for Windows Subsystem for Linux (WSL).
  • Enabled W4A16 support for compressed-tensors MoE models using MXFP4.
  • Added quantization support (Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors) for non-gated MoE models.
  • Integrated Quantization Toolkit for Intel platforms.
  • Enabled per-tensor and per-attention-head FP8 KV cache quantization via llmcompressor.
  • Responses API now supports partial message generation.
  • Added `include_stop_str_in_output` tuning parameter to Responses API.
  • Added `prompt_cache_key` support to Responses API.
  • OpenAI API now supports `skip_special_tokens` configuration.
  • Score endpoint now supports flexible input formats using `data_1`/`data_2` and `queries`/`documents`.
  • Added new Render endpoints for prompt preprocessing.
  • Whisper API now returns `avg_logprob` and `compression_ratio` in verbose_json segments.
  • Added FIPS 140-3 compliant hash option for security.
  • Added `--ssl-ciphers` CLI argument for security configuration.
  • Implemented auto detection of `api_server_count` based on `dp_size`.
  • Enabled wheel variant auto-detection during installation.
  • Allowed custom profiler URI schemes.

🐛 Bug Fixes

  • Fixed configuration issue in speculative decoding for Eagle draft_model_config.
  • Fixed incompatible scale shapes when using DeepSeek-V3.1 with DeepGEMM.
  • Fixed DP+MoE inference issue using CpuCommunicator.
  • Fixed P/D inference issue when models are not using MoE with DP.
  • Fixed possible deadlock issue in EPLB.
  • Fixed UCX memory leak on NIXL by exporting UCX_MEM_MMAP_HOOK_MODE=none.
  • Fixed byte fallback handling for structured output using outlines.

Affected Symbols

⚡ Deprecations

  • HQQ quantization method is deprecated.