Change8

v0.10.1

Breaking Changes
📦 vllm
2 breaking10 features🐛 5 fixes1 deprecations🔧 10 symbols

Summary

v0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.

⚠️ Breaking Changes

  • Removed AQLM quantization support. Users must migrate to alternative quantization methods.
  • V0 FA3 support is deprecated, which may cause issues with FP8 kv-cache in the V0 engine.

Migration Steps

  1. Migrate away from AQLM quantization to other supported formats like BitsAndBytes or ModelOpt.
  2. If using FlashInfer, install it as an optional dependency using 'pip install vllm[flashinfer]'.
  3. Update model names for Ernie 4.5 Base 0.3B if applicable.
  4. Review FP8 kv-cache configurations if still using the V0 engine due to FA3 deprecation.

✨ New Features

  • New model family support: GPT-OSS, Command-A-Vision, mBART, and SmolLM3.
  • Vision-language model support: Eagle, Step3, Gemma3n, MiniCPM-V 4.0, HyperCLOVAX-SEED-Vision-Instruct-3B, Emu3, Intern-S1, and Prithvi.
  • NVIDIA Blackwell (SM100) and RTX 5090/PRO 6000 (SM120) hardware optimizations.
  • Encoder-only model support for BERT-style architectures without KV-cache.
  • Model loader plugin system for extensibility.
  • Dedicated LLM.reward interface for reward models.
  • N-gram speculative decoding with single KMP token proposal algorithm.
  • Support for Unix domain sockets in OpenAI-compatible API.
  • Tree attention backend for v1 engine (experimental).
  • MXFP4 and NVFP4 quantization support for Marlin and FlashInfer backends.

🐛 Bug Fixes

  • Fixed ARM CPU builds for systems without BF16 support.
  • Corrected tool_choice='required' behavior to align with OpenAI spec when tools list is empty.
  • Fixed AsyncLLM response handling for aborted requests.
  • Resolved device-to-device copy overhead in Mamba2.
  • Fixed non-contiguous tensor support in FP8 quantization.

🔧 Affected Symbols

LLM.rewardPoolingParamsFusedMoeStep3VisionEncoderAsyncLLMHermesToolParserNixlConnectorFlashInferMamba2AQLM

⚡ Deprecations

  • V0 FlashAttention 3 (FA3) support is deprecated.