Change8

v0.9.2rc2

Breaking Changes
📦 vllm
1 breaking10 features🐛 10 fixes1 deprecations🔧 8 symbols

Summary

This release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.

⚠️ Breaking Changes

  • Removed V0 backends for CPU, XPU, and TPU. Users must transition to the V1 engine for these hardware platforms.

Migration Steps

  1. Switch to V1 engine if using CPU, XPU, or TPU backends as V0 versions are removed.
  2. Update CI/build pipelines to include the new kvcache-connector dependency.

✨ New Features

  • Enable fp8 support for pplx and BatchedTritonExperts kernels.
  • Add custom default max tokens configuration.
  • Support Llama 4 for fused_marlin_moe.
  • Automatic conversion of CrossEncoding models.
  • Enable V1 engine support for Hybrid SSM/Attention Models.
  • CUTLASS block scaled group gemm support for SM100 (Blackwell).
  • Re-add fp32 support to V1 engine via FlexAttention.
  • Support image objects in llm.chat frontend.
  • Support any head size for FlexAttention backend in V1.
  • Add Triton Fused MoE kernel config for FP8 E=16 on B200.

🐛 Bug Fixes

  • Fix 'Unable to detect current VLLM config' warning for NHD kv cache layout.
  • Register reducer even if transformers_modules is not available.
  • Fix structure of transcription's decoder_prompt in frontend.
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod.
  • Support empty sequence in cuda penalty kernel.
  • Fix missing per_act_token parameter in compressed_tensors_moe.
  • Fix ImportError when building on Hopper systems.
  • Fix MoE OOM issue on TPU.
  • Fix speculative token IDs in model runner.
  • Add use_cross_encoder flag to fix activation in ClassifierPooler.

🔧 Affected Symbols

pplxBatchedTritonExpertsfused_marlin_moeClassifierPoolerCompressedTensorsW8A8Fp8MoECutlassMethodFlexAttentionllm.chatcompressed_tensors_moe

⚡ Deprecations

  • V0 CPU/XPU/TPU backends have been officially removed in favor of V1.