Change8

v0.10.0rc1

Breaking Changes
📦 vllm
1 breaking12 features🐛 10 fixes1 deprecations🔧 7 symbols

Summary

This release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.

⚠️ Breaking Changes

  • Removal of V0 backends for CPU, XPU, and TPU. Users must transition to the V1 engine for these hardware platforms.

Migration Steps

  1. Upgrade to the V1 engine if using CPU, XPU, or TPU backends as V0 is no longer supported.
  2. Ensure the 'openai' python package meets the new minimum version requirements.
  3. Update custom model implementations to use the new 'use_cross_encoder' flag in ClassifierPooler if applicable.

✨ New Features

  • Enable fp8 support for pplx and BatchedTritonExperts.
  • Add custom default max tokens configuration.
  • Support Llama 4 for fused_marlin_moe.
  • Automatic conversion of CrossEncoding models.
  • Enable V1 engine for Hybrid SSM/Attention Models.
  • CUTLASS block scaled group gemm support for SM100.
  • Re-add fp32 support to V1 engine via FlexAttention.
  • Support image objects in llm.chat frontend.
  • Implement initial OpenAI Responses API support.
  • Support for microbatch tokenization.
  • Add xccl support for Intel XPU and abstract platform interface for distributed backends.
  • Support any head size for FlexAttention backend in V1.

🐛 Bug Fixes

  • Fix 'Unable to detect current VLLM config' warning for NHD kv cache layout.
  • Register reducer even if transformers_modules is unavailable.
  • Fix transcription decoder_prompt structure in frontend.
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod.
  • Support empty sequences in cuda penalty kernel.
  • Fix missing per_act_token parameter in compressed_tensors_moe.
  • Fix ImportError when building on Hopper systems.
  • Fix MoE OOM issues on TPU.
  • Fix spec token IDs in model runner for speculative decoding.
  • Fix IndexError for cached requests when pipeline parallelism is disabled.

🔧 Affected Symbols

ClassifierPoolerCompressedTensorsW8A8Fp8MoECutlassMethodFlexAttentionllm.chatfused_marlin_moeBatchedTritonExpertsDPEngineCoreActors

⚡ Deprecations

  • V0 CPU/XPU/TPU backends have been removed in favor of V1.