v0.9.2rc1
Breaking Changes📦 vllmView on GitHub →
⚠ 3 breaking✨ 9 features🐛 9 fixes⚡ 1 deprecations🔧 7 symbols
Summary
This release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.
⚠️ Breaking Changes
- ROCm platforms now use V1 engine by default, which may change performance characteristics or feature availability.
- Removed MultiModalHasher.hash_prompt_mm_data; code relying on this internal method will fail.
- New security policy prevents new imports of (cloud)pickle to mitigate deserialization vulnerabilities.
Migration Steps
- Update FlashInfer to 0.2.6.post1 if using FlashInfer backend.
- Ensure inputs are contiguous when using dynamic_per_token FP8/INT8 quantization.
- If using ROCm, review the V1 User Guide as it is now the default engine.
- Replace any custom usage of MultiModalHasher.hash_prompt_mm_data with standard hashing logic.
✨ New Features
- Support for Qwen3 Embedding & Reranker models.
- Support for deep_gemm in linear methods for improved performance.
- Added H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8.
- Added Triton Fused MoE kernel config for E=16 on B200.
- Support for non-string values in JSON keys via CLI.
- Support for non-privileged mode on CPU for Docker and Kubernetes deployments.
- Vectorized static/dynamic INT8 quantization kernels for better performance.
- Added feedback during CUDAGraph capture for better user experience.
- Added activation chunking logic to FusedMoEModularKernel.
🐛 Bug Fixes
- Fixed use_cudagraph to work with dynamic VLLM_USE_V1.
- Fixed docker build error for cpu-dev images.
- Fixed incremental detokenization edge case error.
- Fixed missing sep_token for Qwen3-Reranker in Score API.
- Fixed Batched DeepGemm Experts.
- Fixed EAGLE vocab embedding for multimodal target models.
- Fixed Python 3.9 compatibility by removing 'strict' argument from zip function.
- Fixed TorchAOConfig skip layers logic.
- Resolved failed concurrent structured output requests in V1 engine.
🔧 Affected Symbols
MultiModalHasher.hash_prompt_mm_dataFusedMoEModularKernelAutoWeightsLoaderTorchAOConfigw8a8_block_fp8_matmul_deepgemmQwen3-RerankerVLLM_USE_V1⚡ Deprecations
- Removed unused MultiModalHasher.hash_prompt_mm_data.