v0.9.2rc1

Breaking Changes

📅 Jul 3, 2025📦 vllmView on GitHub →

⚠ 3 breaking✨ 9 features🐛 9 fixes⚡ 1 deprecations🔧 7 symbols

Summary

This release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.

⚠️ Breaking Changes

ROCm platforms now use V1 engine by default, which may change performance characteristics or feature availability.
Removed MultiModalHasher.hash_prompt_mm_data; code relying on this internal method will fail.
New security policy prevents new imports of (cloud)pickle to mitigate deserialization vulnerabilities.

Migration Steps

Update FlashInfer to 0.2.6.post1 if using FlashInfer backend.
Ensure inputs are contiguous when using dynamic_per_token FP8/INT8 quantization.
If using ROCm, review the V1 User Guide as it is now the default engine.
Replace any custom usage of MultiModalHasher.hash_prompt_mm_data with standard hashing logic.

✨ New Features

Support for Qwen3 Embedding & Reranker models.
Support for deep_gemm in linear methods for improved performance.
Added H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8.
Added Triton Fused MoE kernel config for E=16 on B200.
Support for non-string values in JSON keys via CLI.
Support for non-privileged mode on CPU for Docker and Kubernetes deployments.
Vectorized static/dynamic INT8 quantization kernels for better performance.
Added feedback during CUDAGraph capture for better user experience.
Added activation chunking logic to FusedMoEModularKernel.

🐛 Bug Fixes

Fixed use_cudagraph to work with dynamic VLLM_USE_V1.
Fixed docker build error for cpu-dev images.
Fixed incremental detokenization edge case error.
Fixed missing sep_token for Qwen3-Reranker in Score API.
Fixed Batched DeepGemm Experts.
Fixed EAGLE vocab embedding for multimodal target models.
Fixed Python 3.9 compatibility by removing 'strict' argument from zip function.
Fixed TorchAOConfig skip layers logic.
Resolved failed concurrent structured output requests in V1 engine.

🔧 Affected Symbols

MultiModalHasher.hash_prompt_mm_dataFusedMoEModularKernelAutoWeightsLoaderTorchAOConfigw8a8_block_fp8_matmul_deepgemmQwen3-RerankerVLLM_USE_V1

⚡ Deprecations

Removed unused MultiModalHasher.hash_prompt_mm_data.