v0.9.2rc2
Breaking Changes📦 vllm
⚠ 1 breaking✨ 10 features🐛 10 fixes⚡ 1 deprecations🔧 8 symbols
Summary
This release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.
⚠️ Breaking Changes
- Removed V0 backends for CPU, XPU, and TPU. Users must transition to the V1 engine for these hardware platforms.
Migration Steps
- Switch to V1 engine if using CPU, XPU, or TPU backends as V0 versions are removed.
- Update CI/build pipelines to include the new kvcache-connector dependency.
✨ New Features
- Enable fp8 support for pplx and BatchedTritonExperts kernels.
- Add custom default max tokens configuration.
- Support Llama 4 for fused_marlin_moe.
- Automatic conversion of CrossEncoding models.
- Enable V1 engine support for Hybrid SSM/Attention Models.
- CUTLASS block scaled group gemm support for SM100 (Blackwell).
- Re-add fp32 support to V1 engine via FlexAttention.
- Support image objects in llm.chat frontend.
- Support any head size for FlexAttention backend in V1.
- Add Triton Fused MoE kernel config for FP8 E=16 on B200.
🐛 Bug Fixes
- Fix 'Unable to detect current VLLM config' warning for NHD kv cache layout.
- Register reducer even if transformers_modules is not available.
- Fix structure of transcription's decoder_prompt in frontend.
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod.
- Support empty sequence in cuda penalty kernel.
- Fix missing per_act_token parameter in compressed_tensors_moe.
- Fix ImportError when building on Hopper systems.
- Fix MoE OOM issue on TPU.
- Fix speculative token IDs in model runner.
- Add use_cross_encoder flag to fix activation in ClassifierPooler.
🔧 Affected Symbols
pplxBatchedTritonExpertsfused_marlin_moeClassifierPoolerCompressedTensorsW8A8Fp8MoECutlassMethodFlexAttentionllm.chatcompressed_tensors_moe⚡ Deprecations
- V0 CPU/XPU/TPU backends have been officially removed in favor of V1.