Change8

v0.9.2

Breaking Changes
📦 vllmView on GitHub →
3 breaking10 features🐛 6 fixes2 deprecations🔧 9 symbols

Summary

This release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.

⚠️ Breaking Changes

  • FP4 emulation has been removed on devices with compute capability less than SM100 (Blackwell).
  • Runtime (cloud)pickle imports are now forbidden for security hardening.
  • This is the final version where V0 engine code and features remain intact; future versions will prioritize V1.

Migration Steps

  1. Migrate from V0 engine to V1 engine as V0 will be removed/changed in the next version.
  2. Update monitoring systems to use new metric names as 'gpu_' prefixes are deprecated for general metrics.
  3. Ensure hardware compatibility for FP4 features (requires SM100+).
  4. Review custom code for any reliance on cloudpickle imports, which are now restricted.

✨ New Features

  • Priority Scheduling, embedding models, and Mamba2 support implemented in V1 engine.
  • Full CUDA-Graph execution for FlashAttention v3 and FlashMLA, including prefix-caching and a live capture progress bar.
  • Expert-Parallel Load Balancer (EPLB) for large-scale serving.
  • Support for NVIDIA Blackwell (SM120/SM100) with CUTLASS W8A8/FP8 kernels and deep-GEMM.
  • Intel GPU (V1) backend with Flash-Attention support.
  • Calibration-free RTN INT4/INT8 quantization pipeline.
  • New OpenAI-compatible endpoints: /v1/audio/translations and revamped /v1/audio/transcriptions.
  • Support for new model families: Ernie 4.5, MiniMax-M1, Slim-MoE, Tencent HunYuan-MoE-V1, Keye-VL-8B-Preview, GLM-4.1 V, Gemma-3 (text), Tarsier 2, Qwen 3 Embedding & Reranker.
  • Native xPyD P2P NCCL transport for disaggregated serving.
  • No-privileged CPU mode for Docker and Kubernetes deployments.

🐛 Bug Fixes

  • Fixed block stranding in disaggregated serving when requests are aborted in the waiting queue.
  • Fixed KV-padding and head-dim issues on TPU.
  • Fixed docker build cpu-dev image errors.
  • Fixed misleading ROCm warnings and environment override warning noise.
  • Resolved NaNs in logits by exporting them to scheduler_stats if output is corrupted.
  • Fixed use_cudagraph compatibility with dynamic VLLM_USE_V1 flag.

🔧 Affected Symbols

LLM.beam_searchllm.chatFusedMoEModularKernelAutoWeightsLoaderMultiModalHasher.hash_prompt_mm_dataCachedRequestDataFlashAttentionFlashMLATritonAttention

⚡ Deprecations

  • Metrics with the 'gpu_' prefix have been deprecated for non-GPU specific metrics.
  • V0 engine is deprecated in favor of V1 engine.