v0.18.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 29 features🐛 26 fixes🔧 24 symbols
Summary
v0.18.0 introduces major features like gRPC serving, GPU-less render serving, and significant improvements to KV cache offloading and Elastic Expert Parallelism. Ray is now an optional dependency, and numerous model-specific fixes and kernel optimizations have been integrated.
⚠️ Breaking Changes
- Ray is no longer a default dependency. Users who rely on Ray for distributed execution must now install it explicitly.
- Cascade attention is disabled by default. If you relied on its previous default behavior, you may need to explicitly enable it.
Migration Steps
- If you previously encountered `CUBLAS_STATUS_INVALID_VALUE` in v0.17.0, reinstall `torch 2.10.0` as PyTorch published a fix.
- If you rely on Ray for distributed execution, install it explicitly (e.g., `pip install ray`).
- If you relied on Cascade attention being enabled by default, explicitly enable it if necessary.
✨ New Features
- Added support for gRPC serving via the new --grpc flag, providing a high-performance RPC interface.
- Introduced the `vllm launch render` command for GPU-less preprocessing and rendering, enabling separation of multimodal preprocessing from GPU inference.
- NGram speculative decoding now runs on GPU and is compatible with the async scheduler, reducing spec decode overhead.
- Improved KV Cache Offloading with smart CPU offloading storing only frequently-reused blocks, FlexKV as a new offloading backend, and support for multiple KV groups in offloading spec.
- Elastic Expert Parallelism Milestone 2 integrated NIXL-EP, enabling dynamic GPU scaling for MoE experts, with a new --enable-ep-weight-filter CLI option for faster EP model loading.
- Updated FlashInfer dependency to version 0.6.6, bringing performance and correctness improvements.
- OpenAI Responses API now supports tool/function calling with streaming.
- Added Beam search support for encoder/decoder models for both offline and online transcriptions (ASR).
- Added support for new model architectures including Sarvam MoE, OLMo Hybrid, HyperCLOVAX-SEED-Think-32B/14B, Kimi-Audio-7B-Instruct, ColPali late-interaction retrieval, and ERNIE pooling models.
- Added support for speculative decoding with Eagle3 for Qwen3.5 and Kimi K2.5 MLA, and Eagle for Mistral Large 3 with dense layers.
- Added LoRA support for Whisper and FP8 LoRA dense kernel.
- Enhanced Multimodal support including online use_audio_in_video, audio extraction from MP4 for Nemotron Nano VL, audio transcription for MP4/M4A/WebM, exposing media_io_kwargs at runtime, and fast media preprocessing for Nano Nemotron VL.
- Model Runner V2 enhancements include probabilistic rejection sampling for spec decode, pooling models support, extensible CUDA graph dispatch, WhisperModelState, XD-RoPE, and model_state CUDA graph capture.
- Implemented FA4 for MLA prefill.
- Added FlashInfer Sparse MLA support for FP8 KV cache and CUDA graphs on ROCm.
- TRTLLM FP8 MoE modular kernel introduced.
- Added FP8 KV cache support for Triton MLA decode.
- Added FlashInfer MoE A2A kernel.
- Removed chunking from FusedMoE for full batch processing.
- Added CustomOp FusedRMSNormGated for torch.compile compatibility.
- Mamba2 SSD prefill Triton kernel optimization added.
- DeepSeek-V3.2 received Vectorized MLA query concat kernel and optimized FP8 KV cache gather for context parallel.
- Added support for 320-dimension MLA head size.
- Implemented Packed recurrent fast path for decode.
- Significant hardware and performance improvements across NVIDIA, AMD ROCm, Intel XPU, CPU (via OneDNN and zentorch), and RISC-V backends.
- ModelOpt MXFP8 MoE support added.
- MXFP4 MoE routing simulation override for accuracy implemented.
- Fault tolerance mechanism added to LMCache.
- Added support for skipping non-local expert weights during EP loading.
🐛 Bug Fixes
- Fixed sporadic stall issues by removing pin_memory.
- Fixed VLM concurrent throughput degradation.
- Fixed DP deadlock.
- Fixed DeepSeek V3.2 OOM during CG profiling.
- Fixed Ray DP startup crash.
- Fixed NCCL rank calculation.
- Fixed zero-init MLA output buffers to prevent NaN.
- Fixed CUDA OOM issues.
- Fixed async scheduling issue related to KV cache offloading.
- Fixed EP scatter race condition.
- Fixed LMCache memory leak and race condition.
- Fixed TP size for MLA multi-reader locking in LMCache.
- Fixed MLA crash with AWQ/GPTQ quantized models.
- Fixed score layer quantization for reranker models.
- Fixed DeepSeek-V3.2 tokenizer space stripping.
- Fixed Qwen3.5 tool calling issues.
- Fixed Qwen3-VL timestamp mismatch.
- Fixed Qwen3-Next TP>1 weight sharding.
- Fixed Qwen3-ASR torch.compile issues.
- Fixed MiniCPM-V audio inference.
- Fixed MiniCPM-O 4.5 ViT attention.
- Fixed routed experts for hybrid models.
- Fixed Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video issues.
- Fixed DeepSeek-OCR empty images crash.
- Fixed KV transfer issue with spec decode in PD Disaggregation.
- Fixed compressed-tensors issue for DeepSeek-R1 on MI300x for ROCm.