v0.8.2

Breaking Changes

📅 Mar 23, 2025📦 vllmView on GitHub →

⚠ 2 breaking✨ 11 features🐛 7 fixes⚡ 1 deprecations🔧 11 symbols

Summary

This release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.

⚠️ Breaking Changes

OpenVINO support has been removed from the core repository. Users must now use the external plugin for OpenVINO support.
TP > 1 is no longer supported for Mamba2 models when Quantization is enabled.

Migration Steps

If using OpenVINO, migrate to the external OpenVINO plugin.
Update PyTorch to 2.6.0 if running on macOS or ROCm (AMD).
If using Mamba2 with quantization, ensure Tensor Parallelism (TP) is set to 1.
Optionally enable the new fastsafetensors loader for improved weight loading performance.

✨ New Features

Integrated fastsafetensors loader for faster model weight loading.
Added guidance backend for structured output with 'auto' fallback mode.
Support for FP8 KV Cache in V1 Engine.
Support for tool calling and reasoning parser.
Added pipeline parallel support to TransformersModel.
Support for Tele-FLM Model.
Enabled spec decode for top-p and top-k sampling.
Added Triton(ROCm) Attention backend support for Nvidia GPUs.
Support for MHA Pallas backend and Tensor parallel MP on TPU.
Added --disable-uvicorn-access-log parameter.
Added disable-any-whitespace option for xgrammar structured output.

🐛 Bug Fixes

Fixed V1 Engine crash when handling requests with duplicate request IDs.
Fixed memory usage issues in V1 engine.
Fixed embedding assignment for InternVL-based models.
Fixed incorrect qwen2.5-vl attention mask pre-computation.
Fixed multi-video inference on LLaVA-Onevision.
Fixed CUDA kernel index data type in fused_kernels/layernorm_utils.cuh.
Fixed issue where 'Total generated tokens' reported 0 when using TGI backend.

🔧 Affected Symbols

V1 EngineTransformersModelSchedulerxgrammarguidanceMamba2InternVLLLaVA-OnevisionQwen2.5-VLTele-FLMRayExecutor

⚡ Deprecations

OpenVINO support in core (replaced by external plugin).