v0.8.5
Breaking Changes📦 vllmView on GitHub →
⚠ 1 breaking✨ 10 features🐛 8 fixes🔧 10 symbols
Summary
This release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.
⚠️ Breaking Changes
- The CLI arguments --enable-chunked-prefill, --multi-step-stream-outputs, and --disable-chunked-mm-input can no longer be explicitly set to 'False'. To disable these features, use the '--no-' prefix (e.g., --no-enable-chunked-prefill).
Migration Steps
- Update CLI scripts to replace '--enable-chunked-prefill False' with '--no-enable-chunked-prefill'.
- Update CLI scripts to replace '--multi-step-stream-outputs False' with '--no-multi-step-stream-outputs'.
- Update CLI scripts to replace '--disable-chunked-mm-input False' with '--no-disable-chunked-mm-input'.
✨ New Features
- Day 0 support for Qwen3 and Qwen3MoE models.
- Added support for ModernBERT, Granite Speech, PLaMo2, Kimi-VL, and Qwen2.5-Omni (thinker only).
- Support for Snowflake Arctic Embed family.
- Structural tag support using xgrammar for tool calling in V1 engine.
- Disaggregated serving with KV Connector API V1 and LMCache KV connector.
- Dynamic LoRA loading from remote servers.
- New 'vllm bench [latency, throughput]' CLI commands.
- Support for Microsoft BitBLAS runtime kernel library for low precision computation.
- Added '/server_info' endpoint to retrieve vllm_config.
- EAGLE-3 speculative decoding support.
🐛 Bug Fixes
- Fixed fp8 weight loading for Qwen3.
- Fixed multi-modal caches not behaving as LRU caches.
- Fixed accuracy for Llama4 Int4 and chat templates.
- Fixed offline multi-modal beam search.
- Fixed broken GritLM model and tests due to missing pooling_metadata.
- Fixed potential CUDA graph breakage for merge_attn_states kernel.
- Fixed exponential padding on TPU V1 when max-num-batched-tokens is not a power of 2.
- Security fix: Prevented binding tcp zmq socket to all interfaces.
🔧 Affected Symbols
SchedulerConfigvllm.benchxgrammarKVConnectorLMCacheSamplerFlashInferBitBLASapi_server.server_infobackend_xgrammar.py