v0.10.0
Breaking Changes📦 vllm
⚠ 6 breaking✨ 9 features🐛 6 fixes⚡ 1 deprecations🔧 10 symbols
Summary
v0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.
⚠️ Breaking Changes
- Removed V0 CPU/XPU/TPU/HPU backends. Users must migrate to V1 or supported backends.
- Removed long context LoRA support.
- Removed Prompt Adapters.
- Removed Phi3-Small & BlockSparse Attention support.
- Removed Spec Decode workers.
- Default model changed to Qwen3-0.6B from previous default.
Migration Steps
- Update PyTorch to 2.7.1 for CUDA environments.
- Update FlashInfer to v0.2.8rc1.
- If using CPU/XPU/TPU/HPU, ensure compatibility with the V1 engine as V0 backends are removed.
- Review model configurations if using Phi3-Small or Prompt Adapters as they are no longer supported.
- Update CLI scripts if relying on the previous default model (now Qwen3-0.6B).
✨ New Features
- New model support for Llama 4 (EAGLE), EXAONE 4.0, Phi-4-mini, Hunyuan V1, and more.
- Experimental async scheduling via --async-scheduling flag.
- NVIDIA Blackwell (SM100) optimizations including DeepGEMM and CUTLASS block scaled group GEMM.
- OpenAI Responses API implementation.
- Multi-task support allowing models to handle multiple tasks and poolers.
- Elastic expert parallel for dynamic GPU scaling.
- ARM CPU int8 quantization and PPC64LE/ARM V1 support.
- MXFP4 quantization support for MoE models.
- Tensorizer S3 integration for model loading.
🐛 Bug Fixes
- Allow use_cudagraph to work with dynamic VLLM_USE_V1.
- Fix docker build cpu-dev image error.
- Fix test_max_model_len in entrypoints.
- Fix misleading ROCm warning messages.
- Fix example code compatibility with latest lmcache.
- Resolved non-string value handling in JSON keys from CLI.
🔧 Affected Symbols
V0 enginePromptAdaptersPhi3-SmallBlockSparse AttentionSpec Decode workersLlamaForSequenceClassificationAutoWeightsLoaderget_tokenizer_infoFusedMoEModularKernelMultiModalHasher.hash_prompt_mm_data⚡ Deprecations
- V0 engine codebase cleanup initiated with several backends and features removed.