v0.8.0rc1

Breaking Changes

📅 Mar 17, 2025📦 vllmView on GitHub →

⚠ 1 breaking✨ 12 features🐛 10 fixes⚡ 1 deprecations🔧 7 symbols

Summary

This release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.

⚠️ Breaking Changes

vLLM no longer sets the global seed automatically (#14274). This may lead to non-deterministic results in environments relying on the previous behavior.

Migration Steps

Manually set the `seed` parameter in your configuration or API calls if you require reproducible results, as the global seed is no longer set by default.
Update scripts using `benchmark_serving.py` to remove the deprecated `--dataset` flag.
If using FlashInfer, ensure it is explicitly installed if `VLLM_ATTENTION_BACKEND` is set to avoid startup errors.

✨ New Features

Enables /score endpoint for embedding models.
Added backend-specific options for guided decoding.
Support for Expert Parallelism (EP) for DeepSeek Models.
Support for SSL Key Rotation in the HTTP Server.
Added streamK for block-quantized CUTLASS kernels.
Support for nvfp4 cutlass gemm on NVIDIA hardware.
Implemented merged multimodal processor for Whisper models.
Added script to setup Ray for multi-node vLLM deployments.
V1 engine now supports parallel sampling (AsyncLLM and LLMEngine).
Added `--show-hidden-metrics-for-version` CLI argument.
Support for `allowed_token_ids` in V1 Sampler.
MLA (Multi-Head Latent Attention) now supports chunked prefill.

🐛 Bug Fixes

Fixed max_num_batched_tokens for MLA.
Fixed OLMo 2 QKV splitting for GQA and MQA.
Fixed CPU all-reduce using native PyTorch implementation.
Fixed invalid port validation logic ('ge' and 'le') in API Server.
Fixed benchmark script inaccuracies when max_model_len < input_len + output_len.
Fixed illegal memory access for MoE on H20 hardware.
Fixed engine core client shutdown hangs in V1.
Fixed memory issue with logits and sampling in V1.
Fixed current stream usage for nvfp4 quantization on NVIDIA.
Fixed boolean conversion for OpenVINO environment variables.

🔧 Affected Symbols

AsyncLLMLLMEngineFlashPagedAttentionbenchmark_serving.pyVLLM_ATTENTION_BACKENDprefix_prefillvllm:cache_config_info

⚡ Deprecations

Deprecated the `--dataset` argument from `benchmark_serving.py`.