v0.9.2
Breaking Changes📦 vllmView on GitHub →
⚠ 3 breaking✨ 10 features🐛 6 fixes⚡ 2 deprecations🔧 9 symbols
Summary
This release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.
⚠️ Breaking Changes
- FP4 emulation has been removed on devices with compute capability less than SM100 (Blackwell).
- Runtime (cloud)pickle imports are now forbidden for security hardening.
- This is the final version where V0 engine code and features remain intact; future versions will prioritize V1.
Migration Steps
- Migrate from V0 engine to V1 engine as V0 will be removed/changed in the next version.
- Update monitoring systems to use new metric names as 'gpu_' prefixes are deprecated for general metrics.
- Ensure hardware compatibility for FP4 features (requires SM100+).
- Review custom code for any reliance on cloudpickle imports, which are now restricted.
✨ New Features
- Priority Scheduling, embedding models, and Mamba2 support implemented in V1 engine.
- Full CUDA-Graph execution for FlashAttention v3 and FlashMLA, including prefix-caching and a live capture progress bar.
- Expert-Parallel Load Balancer (EPLB) for large-scale serving.
- Support for NVIDIA Blackwell (SM120/SM100) with CUTLASS W8A8/FP8 kernels and deep-GEMM.
- Intel GPU (V1) backend with Flash-Attention support.
- Calibration-free RTN INT4/INT8 quantization pipeline.
- New OpenAI-compatible endpoints: /v1/audio/translations and revamped /v1/audio/transcriptions.
- Support for new model families: Ernie 4.5, MiniMax-M1, Slim-MoE, Tencent HunYuan-MoE-V1, Keye-VL-8B-Preview, GLM-4.1 V, Gemma-3 (text), Tarsier 2, Qwen 3 Embedding & Reranker.
- Native xPyD P2P NCCL transport for disaggregated serving.
- No-privileged CPU mode for Docker and Kubernetes deployments.
🐛 Bug Fixes
- Fixed block stranding in disaggregated serving when requests are aborted in the waiting queue.
- Fixed KV-padding and head-dim issues on TPU.
- Fixed docker build cpu-dev image errors.
- Fixed misleading ROCm warnings and environment override warning noise.
- Resolved NaNs in logits by exporting them to scheduler_stats if output is corrupted.
- Fixed use_cudagraph compatibility with dynamic VLLM_USE_V1 flag.
🔧 Affected Symbols
LLM.beam_searchllm.chatFusedMoEModularKernelAutoWeightsLoaderMultiModalHasher.hash_prompt_mm_dataCachedRequestDataFlashAttentionFlashMLATritonAttention⚡ Deprecations
- Metrics with the 'gpu_' prefix have been deprecated for non-GPU specific metrics.
- V0 engine is deprecated in favor of V1 engine.