v0.10.0

Breaking Changes

📅 Jul 24, 2025📦 vllm

⚠ 6 breaking✨ 9 features🐛 6 fixes⚡ 1 deprecations🔧 10 symbols

Summary

v0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.

⚠️ Breaking Changes

Removed V0 CPU/XPU/TPU/HPU backends. Users must migrate to V1 or supported backends.
Removed long context LoRA support.
Removed Prompt Adapters.
Removed Phi3-Small & BlockSparse Attention support.
Removed Spec Decode workers.
Default model changed to Qwen3-0.6B from previous default.

Migration Steps

Update PyTorch to 2.7.1 for CUDA environments.
Update FlashInfer to v0.2.8rc1.
If using CPU/XPU/TPU/HPU, ensure compatibility with the V1 engine as V0 backends are removed.
Review model configurations if using Phi3-Small or Prompt Adapters as they are no longer supported.
Update CLI scripts if relying on the previous default model (now Qwen3-0.6B).

✨ New Features

New model support for Llama 4 (EAGLE), EXAONE 4.0, Phi-4-mini, Hunyuan V1, and more.
Experimental async scheduling via --async-scheduling flag.
NVIDIA Blackwell (SM100) optimizations including DeepGEMM and CUTLASS block scaled group GEMM.
OpenAI Responses API implementation.
Multi-task support allowing models to handle multiple tasks and poolers.
Elastic expert parallel for dynamic GPU scaling.
ARM CPU int8 quantization and PPC64LE/ARM V1 support.
MXFP4 quantization support for MoE models.
Tensorizer S3 integration for model loading.

🐛 Bug Fixes

Allow use_cudagraph to work with dynamic VLLM_USE_V1.
Fix docker build cpu-dev image error.
Fix test_max_model_len in entrypoints.
Fix misleading ROCm warning messages.
Fix example code compatibility with latest lmcache.
Resolved non-string value handling in JSON keys from CLI.

🔧 Affected Symbols

V0 enginePromptAdaptersPhi3-SmallBlockSparse AttentionSpec Decode workersLlamaForSequenceClassificationAutoWeightsLoaderget_tokenizer_infoFusedMoEModularKernelMultiModalHasher.hash_prompt_mm_data

⚡ Deprecations

V0 engine codebase cleanup initiated with several backends and features removed.