v0.8.0
Breaking Changes📦 vllmView on GitHub →
⚠ 3 breaking✨ 8 features🐛 4 fixes⚡ 3 deprecations🔧 11 symbols
Summary
v0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.
⚠️ Breaking Changes
- The default value of 'seed' is now None. To ensure reproducibility, you must now explicitly set the seed value.
- The 'kv_cache' and 'attn_metadata' arguments have been removed from the model's forward method. Access these via 'forward_context' instead.
- vLLM now defaults 'generation_config' from the model for chat templates and sampling parameters (e.g., temperature), which may change output behavior if not specified.
Migration Steps
- Update code to explicitly set 'seed' if reproducibility is required.
- Refactor custom model forward methods to use 'forward_context' instead of 'kv_cache' and 'attn_metadata' arguments.
- To use Gemma 3, install transformers from the main branch: 'pip install git+https://github.com/huggingface/transformers.git'.
- If V1 engine causes issues, disable it by setting the environment variable 'VLLM_USE_V1=0'.
- Update monitoring systems to replace deprecated vllm metrics.
✨ New Features
- V1 engine enabled by default for supported use cases.
- Support for Structured Outputs and reasoning outputs (including outlines engine support).
- DeepSeek improvements: FlashMLA integration, Expert Parallelism (EP), and Data Parallelism (DP) support.
- New model support: Gemma 3, Mistral Small 3.1, Phi-4-multimodal-instruct, Grok1, QwQ-32B, and Zamba2.
- NVIDIA Blackwell support: nvfp4 cutlass gemm and ModelOpt FP4 checkpoint support.
- API Server enhancements: /load and /is_sleeping endpoints, and SSL Key Rotation.
- Disaggregated Serving: KV cache offloading and disagg prefill via LMCache connector.
- Hardware support expansions for AMD (ROCm), TPU, Neuron, CPU (FP8 KV cache), and s390x.
🐛 Bug Fixes
- Fixed illegal memory access for MoE on H20 and blockwise cutlass fp8 GEMMs.
- Fixed FP16 overflow issues for DeepSeek V2.
- Resolved aiohttp and jinja2 CVE vulnerabilities.
- Fixed driver environment variable passing to Ray workers.
🔧 Affected Symbols
VLLM_USE_V1forward_contextgeneration_configSupportsV0OnlyFlashMLALMCacheSGMVBGMVvllm:time_in_queue_requestsvllm:model_forward_time_millisecondsvllm:model_execute_time_milliseconds⚡ Deprecations
- Request time metrics: 'vllm:time_in_queue_requests', 'vllm:model_forward_time_milliseconds', and 'vllm:model_execute_time_milliseconds'.
- Legacy input mapper for out-of-tree (OOT) multimodal models.
- SGMV and BGMV Kernels have been retired in favor of newer implementations.