v0.9.2

Breaking Changes

📅 Jul 7, 2025📦 vllmView on GitHub →

⚠ 3 breaking✨ 10 features🐛 6 fixes⚡ 2 deprecations🔧 9 symbols

Summary

This release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.

⚠️ Breaking Changes

FP4 emulation has been removed on devices with compute capability less than SM100 (Blackwell).
Runtime (cloud)pickle imports are now forbidden for security hardening.
This is the final version where V0 engine code and features remain intact; future versions will prioritize V1.

Migration Steps

Migrate from V0 engine to V1 engine as V0 will be removed/changed in the next version.
Update monitoring systems to use new metric names as 'gpu_' prefixes are deprecated for general metrics.
Ensure hardware compatibility for FP4 features (requires SM100+).
Review custom code for any reliance on cloudpickle imports, which are now restricted.

✨ New Features

Priority Scheduling, embedding models, and Mamba2 support implemented in V1 engine.
Full CUDA-Graph execution for FlashAttention v3 and FlashMLA, including prefix-caching and a live capture progress bar.
Expert-Parallel Load Balancer (EPLB) for large-scale serving.
Support for NVIDIA Blackwell (SM120/SM100) with CUTLASS W8A8/FP8 kernels and deep-GEMM.
Intel GPU (V1) backend with Flash-Attention support.
Calibration-free RTN INT4/INT8 quantization pipeline.
New OpenAI-compatible endpoints: /v1/audio/translations and revamped /v1/audio/transcriptions.
Support for new model families: Ernie 4.5, MiniMax-M1, Slim-MoE, Tencent HunYuan-MoE-V1, Keye-VL-8B-Preview, GLM-4.1 V, Gemma-3 (text), Tarsier 2, Qwen 3 Embedding & Reranker.
Native xPyD P2P NCCL transport for disaggregated serving.
No-privileged CPU mode for Docker and Kubernetes deployments.

🐛 Bug Fixes

Fixed block stranding in disaggregated serving when requests are aborted in the waiting queue.
Fixed KV-padding and head-dim issues on TPU.
Fixed docker build cpu-dev image errors.
Fixed misleading ROCm warnings and environment override warning noise.
Resolved NaNs in logits by exporting them to scheduler_stats if output is corrupted.
Fixed use_cudagraph compatibility with dynamic VLLM_USE_V1 flag.

🔧 Affected Symbols

LLM.beam_searchllm.chatFusedMoEModularKernelAutoWeightsLoaderMultiModalHasher.hash_prompt_mm_dataCachedRequestDataFlashAttentionFlashMLATritonAttention

⚡ Deprecations

Metrics with the 'gpu_' prefix have been deprecated for non-GPU specific metrics.
V0 engine is deprecated in favor of V1 engine.