v0.15.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 50 features🐛 7 fixes⚡ 1 deprecations🔧 18 symbols
Summary
This release introduces extensive model support, significant performance enhancements across NVIDIA and AMD hardware (especially for MoE and FP4), and new API features like session-based streaming input. Several deprecated metrics and quantization methods have been removed.
⚠️ Breaking Changes
- Removed deprecated metric `vllm:time_per_output_token_seconds`; users must switch to using `vllm:inter_token_latency_seconds` instead.
- Removed deprecated environment variables.
Migration Steps
- Replace usage of the deprecated metric `vllm:time_per_output_token_seconds` with `vllm:inter_token_latency_seconds`.
- If using DeepSpeedFp8 quantization, migrate to an alternative method as it has been removed.
- If using RTN quantization, migrate to an alternative method as it has been removed.
- If using HQQ quantization, migrate to an alternative method as it is now deprecated.
✨ New Features
- Added support for Kimi-K2.5, Molmo2, Step3vl 10B, Step1, GLM-Lite, and Eagle2.5-8B VLM model architectures.
- Added LoRA support for Nemotron-H, InternVL2, and MiniMax M2.
- Enabled speculative decoding for EAGLE3 (Pixtral/LlavaForConditionalGeneration), Qwen3 VL MoE, and added general draft model support.
- Introduced BGE-M3 sparse embeddings and ColBERT embeddings support.
- Implemented Voxtral streaming architecture enhancements.
- Added SharedFusedMoE support for Qwen3MoE.
- Enabled dynamic resolution for Nemotron Nano VL.
- Enabled Molmo2 vision backbone quantization.
- Enabled `--async-scheduling` to work concurrently with pipeline parallelism.
- Implemented Mamba prefix caching using `--enable-prefix-caching --mamba-cache-mode align` for Mamba/hybrid models, yielding ~2x speedup.
- Introduced session-based streaming input, accepting async generators producing `StreamingInput` objects for interactive workloads like ASR.
- Model Runner V2 now supports VLM.
- Implemented inplace loading for LoRA for improved memory efficiency.
- Added support for torch.compile inductor artifacts via AOT compilation.
- FlashInfer MLA is now the default MLA backend on Blackwell GPUs, with TRTLLM as the default prefill backend.
- Implemented grouped topk kernel fusion for MoE, resulting in 1.2-2% E2E throughput improvement.
- Improved NVFP4 small-batch decoding performance.
- Enabled faster cold start for MoEs when using torch.compile.
- Optimized FP4 quantization on Blackwell (SM100F) using 256-bit loads, leading to up to 65% speedup and ~4% E2E throughput improvement.
- Added topk_sigmoid kernel for MoE routing.
- Added atomics reduce counting for SplitK skinny GEMMs.
- Fused cat+quant operation for FP8 KV cache in MLA.
- Enabled SiluAndMul and QuantFP8 CustomOp compilation via torch.compile.
- Improved Triton prefill attention performance via torch.compile.
- Added MoRI EP (Expert Parallel all2all backend) for AMD ROCm.
- Improved AMD ROCm attention via Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend.
- Enabled FP4 MLA projection GEMMs with dynamic quantization on AMD.
- Enabled Flash Attention Triton backend on AMD RDNA3/RDNA4 consumer GPUs.
- Added pipeline parallelism support for TPUs.
- Added backend option for TPU execution.
- Implemented AgRsAll2AllManager for distributed communication on Intel XPU.
- Enabled NUMA-aware acceleration for TP/DP inference on ARM CPUs.
- Added torch.compile support for Whisper models.
- Fixed platform compatibility issue for Windows Subsystem for Linux (WSL).
- Enabled W4A16 support for compressed-tensors MoE models using MXFP4.
- Added quantization support (Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors) for non-gated MoE models.
- Integrated Quantization Toolkit for Intel platforms.
- Enabled per-tensor and per-attention-head FP8 KV cache quantization via llmcompressor.
- Responses API now supports partial message generation.
- Added `include_stop_str_in_output` tuning parameter to Responses API.
- Added `prompt_cache_key` support to Responses API.
- OpenAI API now supports `skip_special_tokens` configuration.
- Score endpoint now supports flexible input formats using `data_1`/`data_2` and `queries`/`documents`.
- Added new Render endpoints for prompt preprocessing.
- Whisper API now returns `avg_logprob` and `compression_ratio` in verbose_json segments.
- Added FIPS 140-3 compliant hash option for security.
- Added `--ssl-ciphers` CLI argument for security configuration.
- Implemented auto detection of `api_server_count` based on `dp_size`.
- Enabled wheel variant auto-detection during installation.
- Allowed custom profiler URI schemes.
🐛 Bug Fixes
- Fixed configuration issue in speculative decoding for Eagle draft_model_config.
- Fixed incompatible scale shapes when using DeepSeek-V3.1 with DeepGEMM.
- Fixed DP+MoE inference issue using CpuCommunicator.
- Fixed P/D inference issue when models are not using MoE with DP.
- Fixed possible deadlock issue in EPLB.
- Fixed UCX memory leak on NIXL by exporting UCX_MEM_MMAP_HOOK_MODE=none.
- Fixed byte fallback handling for structured output using outlines.
Affected Symbols
⚡ Deprecations
- HQQ quantization method is deprecated.