v0.8.2
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 11 features🐛 7 fixes⚡ 1 deprecations🔧 11 symbols
Summary
This release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.
⚠️ Breaking Changes
- OpenVINO support has been removed from the core repository. Users must now use the external plugin for OpenVINO support.
- TP > 1 is no longer supported for Mamba2 models when Quantization is enabled.
Migration Steps
- If using OpenVINO, migrate to the external OpenVINO plugin.
- Update PyTorch to 2.6.0 if running on macOS or ROCm (AMD).
- If using Mamba2 with quantization, ensure Tensor Parallelism (TP) is set to 1.
- Optionally enable the new fastsafetensors loader for improved weight loading performance.
✨ New Features
- Integrated fastsafetensors loader for faster model weight loading.
- Added guidance backend for structured output with 'auto' fallback mode.
- Support for FP8 KV Cache in V1 Engine.
- Support for tool calling and reasoning parser.
- Added pipeline parallel support to TransformersModel.
- Support for Tele-FLM Model.
- Enabled spec decode for top-p and top-k sampling.
- Added Triton(ROCm) Attention backend support for Nvidia GPUs.
- Support for MHA Pallas backend and Tensor parallel MP on TPU.
- Added --disable-uvicorn-access-log parameter.
- Added disable-any-whitespace option for xgrammar structured output.
🐛 Bug Fixes
- Fixed V1 Engine crash when handling requests with duplicate request IDs.
- Fixed memory usage issues in V1 engine.
- Fixed embedding assignment for InternVL-based models.
- Fixed incorrect qwen2.5-vl attention mask pre-computation.
- Fixed multi-video inference on LLaVA-Onevision.
- Fixed CUDA kernel index data type in fused_kernels/layernorm_utils.cuh.
- Fixed issue where 'Total generated tokens' reported 0 when using TGI backend.
🔧 Affected Symbols
V1 EngineTransformersModelSchedulerxgrammarguidanceMamba2InternVLLLaVA-OnevisionQwen2.5-VLTele-FLMRayExecutor⚡ Deprecations
- OpenVINO support in core (replaced by external plugin).