Change8

v0.8.0rc1

Breaking Changes
📦 vllmView on GitHub →
1 breaking12 features🐛 10 fixes1 deprecations🔧 7 symbols

Summary

This release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.

⚠️ Breaking Changes

  • vLLM no longer sets the global seed automatically (#14274). This may lead to non-deterministic results in environments relying on the previous behavior.

Migration Steps

  1. Manually set the `seed` parameter in your configuration or API calls if you require reproducible results, as the global seed is no longer set by default.
  2. Update scripts using `benchmark_serving.py` to remove the deprecated `--dataset` flag.
  3. If using FlashInfer, ensure it is explicitly installed if `VLLM_ATTENTION_BACKEND` is set to avoid startup errors.

✨ New Features

  • Enables /score endpoint for embedding models.
  • Added backend-specific options for guided decoding.
  • Support for Expert Parallelism (EP) for DeepSeek Models.
  • Support for SSL Key Rotation in the HTTP Server.
  • Added streamK for block-quantized CUTLASS kernels.
  • Support for nvfp4 cutlass gemm on NVIDIA hardware.
  • Implemented merged multimodal processor for Whisper models.
  • Added script to setup Ray for multi-node vLLM deployments.
  • V1 engine now supports parallel sampling (AsyncLLM and LLMEngine).
  • Added `--show-hidden-metrics-for-version` CLI argument.
  • Support for `allowed_token_ids` in V1 Sampler.
  • MLA (Multi-Head Latent Attention) now supports chunked prefill.

🐛 Bug Fixes

  • Fixed max_num_batched_tokens for MLA.
  • Fixed OLMo 2 QKV splitting for GQA and MQA.
  • Fixed CPU all-reduce using native PyTorch implementation.
  • Fixed invalid port validation logic ('ge' and 'le') in API Server.
  • Fixed benchmark script inaccuracies when max_model_len < input_len + output_len.
  • Fixed illegal memory access for MoE on H20 hardware.
  • Fixed engine core client shutdown hangs in V1.
  • Fixed memory issue with logits and sampling in V1.
  • Fixed current stream usage for nvfp4 quantization on NVIDIA.
  • Fixed boolean conversion for OpenVINO environment variables.

🔧 Affected Symbols

AsyncLLMLLMEngineFlashPagedAttentionbenchmark_serving.pyVLLM_ATTENTION_BACKENDprefix_prefillvllm:cache_config_info

⚡ Deprecations

  • Deprecated the `--dataset` argument from `benchmark_serving.py`.