v0.8.5

Breaking Changes

📅 Apr 28, 2025📦 vllmView on GitHub →

⚠ 1 breaking✨ 10 features🐛 8 fixes🔧 10 symbols

Summary

This release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.

⚠️ Breaking Changes

The CLI arguments --enable-chunked-prefill, --multi-step-stream-outputs, and --disable-chunked-mm-input can no longer be explicitly set to 'False'. To disable these features, use the '--no-' prefix (e.g., --no-enable-chunked-prefill).

Migration Steps

Update CLI scripts to replace '--enable-chunked-prefill False' with '--no-enable-chunked-prefill'.
Update CLI scripts to replace '--multi-step-stream-outputs False' with '--no-multi-step-stream-outputs'.
Update CLI scripts to replace '--disable-chunked-mm-input False' with '--no-disable-chunked-mm-input'.

✨ New Features

Day 0 support for Qwen3 and Qwen3MoE models.
Added support for ModernBERT, Granite Speech, PLaMo2, Kimi-VL, and Qwen2.5-Omni (thinker only).
Support for Snowflake Arctic Embed family.
Structural tag support using xgrammar for tool calling in V1 engine.
Disaggregated serving with KV Connector API V1 and LMCache KV connector.
Dynamic LoRA loading from remote servers.
New 'vllm bench [latency, throughput]' CLI commands.
Support for Microsoft BitBLAS runtime kernel library for low precision computation.
Added '/server_info' endpoint to retrieve vllm_config.
EAGLE-3 speculative decoding support.

🐛 Bug Fixes

Fixed fp8 weight loading for Qwen3.
Fixed multi-modal caches not behaving as LRU caches.
Fixed accuracy for Llama4 Int4 and chat templates.
Fixed offline multi-modal beam search.
Fixed broken GritLM model and tests due to missing pooling_metadata.
Fixed potential CUDA graph breakage for merge_attn_states kernel.
Fixed exponential padding on TPU V1 when max-num-batched-tokens is not a power of 2.
Security fix: Prevented binding tcp zmq socket to all interfaces.

🔧 Affected Symbols

SchedulerConfigvllm.benchxgrammarKVConnectorLMCacheSamplerFlashInferBitBLASapi_server.server_infobackend_xgrammar.py