v0.10.0rc1
Breaking Changes📦 vllm
⚠ 1 breaking✨ 12 features🐛 10 fixes⚡ 1 deprecations🔧 7 symbols
Summary
This release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.
⚠️ Breaking Changes
- Removal of V0 backends for CPU, XPU, and TPU. Users must transition to the V1 engine for these hardware platforms.
Migration Steps
- Upgrade to the V1 engine if using CPU, XPU, or TPU backends as V0 is no longer supported.
- Ensure the 'openai' python package meets the new minimum version requirements.
- Update custom model implementations to use the new 'use_cross_encoder' flag in ClassifierPooler if applicable.
✨ New Features
- Enable fp8 support for pplx and BatchedTritonExperts.
- Add custom default max tokens configuration.
- Support Llama 4 for fused_marlin_moe.
- Automatic conversion of CrossEncoding models.
- Enable V1 engine for Hybrid SSM/Attention Models.
- CUTLASS block scaled group gemm support for SM100.
- Re-add fp32 support to V1 engine via FlexAttention.
- Support image objects in llm.chat frontend.
- Implement initial OpenAI Responses API support.
- Support for microbatch tokenization.
- Add xccl support for Intel XPU and abstract platform interface for distributed backends.
- Support any head size for FlexAttention backend in V1.
🐛 Bug Fixes
- Fix 'Unable to detect current VLLM config' warning for NHD kv cache layout.
- Register reducer even if transformers_modules is unavailable.
- Fix transcription decoder_prompt structure in frontend.
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod.
- Support empty sequences in cuda penalty kernel.
- Fix missing per_act_token parameter in compressed_tensors_moe.
- Fix ImportError when building on Hopper systems.
- Fix MoE OOM issues on TPU.
- Fix spec token IDs in model runner for speculative decoding.
- Fix IndexError for cached requests when pipeline parallelism is disabled.
🔧 Affected Symbols
ClassifierPoolerCompressedTensorsW8A8Fp8MoECutlassMethodFlexAttentionllm.chatfused_marlin_moeBatchedTritonExpertsDPEngineCoreActors⚡ Deprecations
- V0 CPU/XPU/TPU backends have been removed in favor of V1.