Change8

v0.10.0

Breaking Changes
📦 vllm
6 breaking9 features🐛 6 fixes1 deprecations🔧 10 symbols

Summary

v0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.

⚠️ Breaking Changes

  • Removed V0 CPU/XPU/TPU/HPU backends. Users must migrate to V1 or supported backends.
  • Removed long context LoRA support.
  • Removed Prompt Adapters.
  • Removed Phi3-Small & BlockSparse Attention support.
  • Removed Spec Decode workers.
  • Default model changed to Qwen3-0.6B from previous default.

Migration Steps

  1. Update PyTorch to 2.7.1 for CUDA environments.
  2. Update FlashInfer to v0.2.8rc1.
  3. If using CPU/XPU/TPU/HPU, ensure compatibility with the V1 engine as V0 backends are removed.
  4. Review model configurations if using Phi3-Small or Prompt Adapters as they are no longer supported.
  5. Update CLI scripts if relying on the previous default model (now Qwen3-0.6B).

✨ New Features

  • New model support for Llama 4 (EAGLE), EXAONE 4.0, Phi-4-mini, Hunyuan V1, and more.
  • Experimental async scheduling via --async-scheduling flag.
  • NVIDIA Blackwell (SM100) optimizations including DeepGEMM and CUTLASS block scaled group GEMM.
  • OpenAI Responses API implementation.
  • Multi-task support allowing models to handle multiple tasks and poolers.
  • Elastic expert parallel for dynamic GPU scaling.
  • ARM CPU int8 quantization and PPC64LE/ARM V1 support.
  • MXFP4 quantization support for MoE models.
  • Tensorizer S3 integration for model loading.

🐛 Bug Fixes

  • Allow use_cudagraph to work with dynamic VLLM_USE_V1.
  • Fix docker build cpu-dev image error.
  • Fix test_max_model_len in entrypoints.
  • Fix misleading ROCm warning messages.
  • Fix example code compatibility with latest lmcache.
  • Resolved non-string value handling in JSON keys from CLI.

🔧 Affected Symbols

V0 enginePromptAdaptersPhi3-SmallBlockSparse AttentionSpec Decode workersLlamaForSequenceClassificationAutoWeightsLoaderget_tokenizer_infoFusedMoEModularKernelMultiModalHasher.hash_prompt_mm_data

⚡ Deprecations

  • V0 engine codebase cleanup initiated with several backends and features removed.