v0.8.0rc1
Breaking Changes📦 vllmView on GitHub →
⚠ 1 breaking✨ 12 features🐛 10 fixes⚡ 1 deprecations🔧 7 symbols
Summary
This release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.
⚠️ Breaking Changes
- vLLM no longer sets the global seed automatically (#14274). This may lead to non-deterministic results in environments relying on the previous behavior.
Migration Steps
- Manually set the `seed` parameter in your configuration or API calls if you require reproducible results, as the global seed is no longer set by default.
- Update scripts using `benchmark_serving.py` to remove the deprecated `--dataset` flag.
- If using FlashInfer, ensure it is explicitly installed if `VLLM_ATTENTION_BACKEND` is set to avoid startup errors.
✨ New Features
- Enables /score endpoint for embedding models.
- Added backend-specific options for guided decoding.
- Support for Expert Parallelism (EP) for DeepSeek Models.
- Support for SSL Key Rotation in the HTTP Server.
- Added streamK for block-quantized CUTLASS kernels.
- Support for nvfp4 cutlass gemm on NVIDIA hardware.
- Implemented merged multimodal processor for Whisper models.
- Added script to setup Ray for multi-node vLLM deployments.
- V1 engine now supports parallel sampling (AsyncLLM and LLMEngine).
- Added `--show-hidden-metrics-for-version` CLI argument.
- Support for `allowed_token_ids` in V1 Sampler.
- MLA (Multi-Head Latent Attention) now supports chunked prefill.
🐛 Bug Fixes
- Fixed max_num_batched_tokens for MLA.
- Fixed OLMo 2 QKV splitting for GQA and MQA.
- Fixed CPU all-reduce using native PyTorch implementation.
- Fixed invalid port validation logic ('ge' and 'le') in API Server.
- Fixed benchmark script inaccuracies when max_model_len < input_len + output_len.
- Fixed illegal memory access for MoE on H20 hardware.
- Fixed engine core client shutdown hangs in V1.
- Fixed memory issue with logits and sampling in V1.
- Fixed current stream usage for nvfp4 quantization on NVIDIA.
- Fixed boolean conversion for OpenVINO environment variables.
🔧 Affected Symbols
AsyncLLMLLMEngineFlashPagedAttentionbenchmark_serving.pyVLLM_ATTENTION_BACKENDprefix_prefillvllm:cache_config_info⚡ Deprecations
- Deprecated the `--dataset` argument from `benchmark_serving.py`.