v0.10.1
Breaking Changes📦 vllm
⚠ 2 breaking✨ 10 features🐛 5 fixes⚡ 1 deprecations🔧 10 symbols
Summary
v0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.
⚠️ Breaking Changes
- Removed AQLM quantization support. Users must migrate to alternative quantization methods.
- V0 FA3 support is deprecated, which may cause issues with FP8 kv-cache in the V0 engine.
Migration Steps
- Migrate away from AQLM quantization to other supported formats like BitsAndBytes or ModelOpt.
- If using FlashInfer, install it as an optional dependency using 'pip install vllm[flashinfer]'.
- Update model names for Ernie 4.5 Base 0.3B if applicable.
- Review FP8 kv-cache configurations if still using the V0 engine due to FA3 deprecation.
✨ New Features
- New model family support: GPT-OSS, Command-A-Vision, mBART, and SmolLM3.
- Vision-language model support: Eagle, Step3, Gemma3n, MiniCPM-V 4.0, HyperCLOVAX-SEED-Vision-Instruct-3B, Emu3, Intern-S1, and Prithvi.
- NVIDIA Blackwell (SM100) and RTX 5090/PRO 6000 (SM120) hardware optimizations.
- Encoder-only model support for BERT-style architectures without KV-cache.
- Model loader plugin system for extensibility.
- Dedicated LLM.reward interface for reward models.
- N-gram speculative decoding with single KMP token proposal algorithm.
- Support for Unix domain sockets in OpenAI-compatible API.
- Tree attention backend for v1 engine (experimental).
- MXFP4 and NVFP4 quantization support for Marlin and FlashInfer backends.
🐛 Bug Fixes
- Fixed ARM CPU builds for systems without BF16 support.
- Corrected tool_choice='required' behavior to align with OpenAI spec when tools list is empty.
- Fixed AsyncLLM response handling for aborted requests.
- Resolved device-to-device copy overhead in Mamba2.
- Fixed non-contiguous tensor support in FP8 quantization.
🔧 Affected Symbols
LLM.rewardPoolingParamsFusedMoeStep3VisionEncoderAsyncLLMHermesToolParserNixlConnectorFlashInferMamba2AQLM⚡ Deprecations
- V0 FlashAttention 3 (FA3) support is deprecated.