Change8

v0.18.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking29 features🐛 26 fixes🔧 24 symbols

Summary

v0.18.0 introduces major features like gRPC serving, GPU-less render serving, and significant improvements to KV cache offloading and Elastic Expert Parallelism. Ray is now an optional dependency, and numerous model-specific fixes and kernel optimizations have been integrated.

⚠️ Breaking Changes

  • Ray is no longer a default dependency. Users who rely on Ray for distributed execution must now install it explicitly.
  • Cascade attention is disabled by default. If you relied on its previous default behavior, you may need to explicitly enable it.

Migration Steps

  1. If you previously encountered `CUBLAS_STATUS_INVALID_VALUE` in v0.17.0, reinstall `torch 2.10.0` as PyTorch published a fix.
  2. If you rely on Ray for distributed execution, install it explicitly (e.g., `pip install ray`).
  3. If you relied on Cascade attention being enabled by default, explicitly enable it if necessary.

✨ New Features

  • Added support for gRPC serving via the new --grpc flag, providing a high-performance RPC interface.
  • Introduced the `vllm launch render` command for GPU-less preprocessing and rendering, enabling separation of multimodal preprocessing from GPU inference.
  • NGram speculative decoding now runs on GPU and is compatible with the async scheduler, reducing spec decode overhead.
  • Improved KV Cache Offloading with smart CPU offloading storing only frequently-reused blocks, FlexKV as a new offloading backend, and support for multiple KV groups in offloading spec.
  • Elastic Expert Parallelism Milestone 2 integrated NIXL-EP, enabling dynamic GPU scaling for MoE experts, with a new --enable-ep-weight-filter CLI option for faster EP model loading.
  • Updated FlashInfer dependency to version 0.6.6, bringing performance and correctness improvements.
  • OpenAI Responses API now supports tool/function calling with streaming.
  • Added Beam search support for encoder/decoder models for both offline and online transcriptions (ASR).
  • Added support for new model architectures including Sarvam MoE, OLMo Hybrid, HyperCLOVAX-SEED-Think-32B/14B, Kimi-Audio-7B-Instruct, ColPali late-interaction retrieval, and ERNIE pooling models.
  • Added support for speculative decoding with Eagle3 for Qwen3.5 and Kimi K2.5 MLA, and Eagle for Mistral Large 3 with dense layers.
  • Added LoRA support for Whisper and FP8 LoRA dense kernel.
  • Enhanced Multimodal support including online use_audio_in_video, audio extraction from MP4 for Nemotron Nano VL, audio transcription for MP4/M4A/WebM, exposing media_io_kwargs at runtime, and fast media preprocessing for Nano Nemotron VL.
  • Model Runner V2 enhancements include probabilistic rejection sampling for spec decode, pooling models support, extensible CUDA graph dispatch, WhisperModelState, XD-RoPE, and model_state CUDA graph capture.
  • Implemented FA4 for MLA prefill.
  • Added FlashInfer Sparse MLA support for FP8 KV cache and CUDA graphs on ROCm.
  • TRTLLM FP8 MoE modular kernel introduced.
  • Added FP8 KV cache support for Triton MLA decode.
  • Added FlashInfer MoE A2A kernel.
  • Removed chunking from FusedMoE for full batch processing.
  • Added CustomOp FusedRMSNormGated for torch.compile compatibility.
  • Mamba2 SSD prefill Triton kernel optimization added.
  • DeepSeek-V3.2 received Vectorized MLA query concat kernel and optimized FP8 KV cache gather for context parallel.
  • Added support for 320-dimension MLA head size.
  • Implemented Packed recurrent fast path for decode.
  • Significant hardware and performance improvements across NVIDIA, AMD ROCm, Intel XPU, CPU (via OneDNN and zentorch), and RISC-V backends.
  • ModelOpt MXFP8 MoE support added.
  • MXFP4 MoE routing simulation override for accuracy implemented.
  • Fault tolerance mechanism added to LMCache.
  • Added support for skipping non-local expert weights during EP loading.

🐛 Bug Fixes

  • Fixed sporadic stall issues by removing pin_memory.
  • Fixed VLM concurrent throughput degradation.
  • Fixed DP deadlock.
  • Fixed DeepSeek V3.2 OOM during CG profiling.
  • Fixed Ray DP startup crash.
  • Fixed NCCL rank calculation.
  • Fixed zero-init MLA output buffers to prevent NaN.
  • Fixed CUDA OOM issues.
  • Fixed async scheduling issue related to KV cache offloading.
  • Fixed EP scatter race condition.
  • Fixed LMCache memory leak and race condition.
  • Fixed TP size for MLA multi-reader locking in LMCache.
  • Fixed MLA crash with AWQ/GPTQ quantized models.
  • Fixed score layer quantization for reranker models.
  • Fixed DeepSeek-V3.2 tokenizer space stripping.
  • Fixed Qwen3.5 tool calling issues.
  • Fixed Qwen3-VL timestamp mismatch.
  • Fixed Qwen3-Next TP>1 weight sharding.
  • Fixed Qwen3-ASR torch.compile issues.
  • Fixed MiniCPM-V audio inference.
  • Fixed MiniCPM-O 4.5 ViT attention.
  • Fixed routed experts for hybrid models.
  • Fixed Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video issues.
  • Fixed DeepSeek-OCR empty images crash.
  • Fixed KV transfer issue with spec decode in PD Disaggregation.
  • Fixed compressed-tensors issue for DeepSeek-R1 on MI300x for ROCm.

Affected Symbols