Change8

v0.8.0

Breaking Changes
📦 vllmView on GitHub →
3 breaking8 features🐛 4 fixes3 deprecations🔧 11 symbols

Summary

v0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.

⚠️ Breaking Changes

  • The default value of 'seed' is now None. To ensure reproducibility, you must now explicitly set the seed value.
  • The 'kv_cache' and 'attn_metadata' arguments have been removed from the model's forward method. Access these via 'forward_context' instead.
  • vLLM now defaults 'generation_config' from the model for chat templates and sampling parameters (e.g., temperature), which may change output behavior if not specified.

Migration Steps

  1. Update code to explicitly set 'seed' if reproducibility is required.
  2. Refactor custom model forward methods to use 'forward_context' instead of 'kv_cache' and 'attn_metadata' arguments.
  3. To use Gemma 3, install transformers from the main branch: 'pip install git+https://github.com/huggingface/transformers.git'.
  4. If V1 engine causes issues, disable it by setting the environment variable 'VLLM_USE_V1=0'.
  5. Update monitoring systems to replace deprecated vllm metrics.

✨ New Features

  • V1 engine enabled by default for supported use cases.
  • Support for Structured Outputs and reasoning outputs (including outlines engine support).
  • DeepSeek improvements: FlashMLA integration, Expert Parallelism (EP), and Data Parallelism (DP) support.
  • New model support: Gemma 3, Mistral Small 3.1, Phi-4-multimodal-instruct, Grok1, QwQ-32B, and Zamba2.
  • NVIDIA Blackwell support: nvfp4 cutlass gemm and ModelOpt FP4 checkpoint support.
  • API Server enhancements: /load and /is_sleeping endpoints, and SSL Key Rotation.
  • Disaggregated Serving: KV cache offloading and disagg prefill via LMCache connector.
  • Hardware support expansions for AMD (ROCm), TPU, Neuron, CPU (FP8 KV cache), and s390x.

🐛 Bug Fixes

  • Fixed illegal memory access for MoE on H20 and blockwise cutlass fp8 GEMMs.
  • Fixed FP16 overflow issues for DeepSeek V2.
  • Resolved aiohttp and jinja2 CVE vulnerabilities.
  • Fixed driver environment variable passing to Ray workers.

🔧 Affected Symbols

VLLM_USE_V1forward_contextgeneration_configSupportsV0OnlyFlashMLALMCacheSGMVBGMVvllm:time_in_queue_requestsvllm:model_forward_time_millisecondsvllm:model_execute_time_milliseconds

⚡ Deprecations

  • Request time metrics: 'vllm:time_in_queue_requests', 'vllm:model_forward_time_milliseconds', and 'vllm:model_execute_time_milliseconds'.
  • Legacy input mapper for out-of-tree (OOT) multimodal models.
  • SGMV and BGMV Kernels have been retired in favor of newer implementations.