v0.18.0

Breaking Changes

📅 Mar 20, 2026📦 vllmView on GitHub →

⚠ 2 breaking✨ 29 features🐛 26 fixes🔧 24 symbols

Summary

v0.18.0 introduces major features like gRPC serving, GPU-less render serving, and significant improvements to KV cache offloading and Elastic Expert Parallelism. Ray is now an optional dependency, and numerous model-specific fixes and kernel optimizations have been integrated.

⚠️ Breaking Changes

Ray is no longer a default dependency. Users who rely on Ray for distributed execution must now install it explicitly.
Cascade attention is disabled by default. If you relied on its previous default behavior, you may need to explicitly enable it.

Migration Steps

If you previously encountered `CUBLAS_STATUS_INVALID_VALUE` in v0.17.0, reinstall `torch 2.10.0` as PyTorch published a fix.
If you rely on Ray for distributed execution, install it explicitly (e.g., `pip install ray`).
If you relied on Cascade attention being enabled by default, explicitly enable it if necessary.

✨ New Features

Added support for gRPC serving via the new --grpc flag, providing a high-performance RPC interface.
Introduced the `vllm launch render` command for GPU-less preprocessing and rendering, enabling separation of multimodal preprocessing from GPU inference.
NGram speculative decoding now runs on GPU and is compatible with the async scheduler, reducing spec decode overhead.
Improved KV Cache Offloading with smart CPU offloading storing only frequently-reused blocks, FlexKV as a new offloading backend, and support for multiple KV groups in offloading spec.
Elastic Expert Parallelism Milestone 2 integrated NIXL-EP, enabling dynamic GPU scaling for MoE experts, with a new --enable-ep-weight-filter CLI option for faster EP model loading.
Updated FlashInfer dependency to version 0.6.6, bringing performance and correctness improvements.
OpenAI Responses API now supports tool/function calling with streaming.
Added Beam search support for encoder/decoder models for both offline and online transcriptions (ASR).
Added support for new model architectures including Sarvam MoE, OLMo Hybrid, HyperCLOVAX-SEED-Think-32B/14B, Kimi-Audio-7B-Instruct, ColPali late-interaction retrieval, and ERNIE pooling models.
Added support for speculative decoding with Eagle3 for Qwen3.5 and Kimi K2.5 MLA, and Eagle for Mistral Large 3 with dense layers.
Added LoRA support for Whisper and FP8 LoRA dense kernel.
Enhanced Multimodal support including online use_audio_in_video, audio extraction from MP4 for Nemotron Nano VL, audio transcription for MP4/M4A/WebM, exposing media_io_kwargs at runtime, and fast media preprocessing for Nano Nemotron VL.
Model Runner V2 enhancements include probabilistic rejection sampling for spec decode, pooling models support, extensible CUDA graph dispatch, WhisperModelState, XD-RoPE, and model_state CUDA graph capture.
Implemented FA4 for MLA prefill.
Added FlashInfer Sparse MLA support for FP8 KV cache and CUDA graphs on ROCm.
TRTLLM FP8 MoE modular kernel introduced.
Added FP8 KV cache support for Triton MLA decode.
Added FlashInfer MoE A2A kernel.
Removed chunking from FusedMoE for full batch processing.
Added CustomOp FusedRMSNormGated for torch.compile compatibility.
Mamba2 SSD prefill Triton kernel optimization added.
DeepSeek-V3.2 received Vectorized MLA query concat kernel and optimized FP8 KV cache gather for context parallel.
Added support for 320-dimension MLA head size.
Implemented Packed recurrent fast path for decode.
Significant hardware and performance improvements across NVIDIA, AMD ROCm, Intel XPU, CPU (via OneDNN and zentorch), and RISC-V backends.
ModelOpt MXFP8 MoE support added.
MXFP4 MoE routing simulation override for accuracy implemented.
Fault tolerance mechanism added to LMCache.
Added support for skipping non-local expert weights during EP loading.

🐛 Bug Fixes

Fixed sporadic stall issues by removing pin_memory.
Fixed VLM concurrent throughput degradation.
Fixed DP deadlock.
Fixed DeepSeek V3.2 OOM during CG profiling.
Fixed Ray DP startup crash.
Fixed NCCL rank calculation.
Fixed zero-init MLA output buffers to prevent NaN.
Fixed CUDA OOM issues.
Fixed async scheduling issue related to KV cache offloading.
Fixed EP scatter race condition.
Fixed LMCache memory leak and race condition.
Fixed TP size for MLA multi-reader locking in LMCache.
Fixed MLA crash with AWQ/GPTQ quantized models.
Fixed score layer quantization for reranker models.
Fixed DeepSeek-V3.2 tokenizer space stripping.
Fixed Qwen3.5 tool calling issues.
Fixed Qwen3-VL timestamp mismatch.
Fixed Qwen3-Next TP>1 weight sharding.
Fixed Qwen3-ASR torch.compile issues.
Fixed MiniCPM-V audio inference.
Fixed MiniCPM-O 4.5 ViT attention.
Fixed routed experts for hybrid models.
Fixed Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video issues.
Fixed DeepSeek-OCR empty images crash.
Fixed KV transfer issue with spec decode in PD Disaggregation.
Fixed compressed-tensors issue for DeepSeek-R1 on MI300x for ROCm.

Summary

⚠️ Breaking Changes

Migration Steps

✨ New Features

🐛 Bug Fixes

Affected Symbols