Change8

v0.8.2

Breaking Changes
📦 vllmView on GitHub →
2 breaking11 features🐛 7 fixes1 deprecations🔧 11 symbols

Summary

This release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.

⚠️ Breaking Changes

  • OpenVINO support has been removed from the core repository. Users must now use the external plugin for OpenVINO support.
  • TP > 1 is no longer supported for Mamba2 models when Quantization is enabled.

Migration Steps

  1. If using OpenVINO, migrate to the external OpenVINO plugin.
  2. Update PyTorch to 2.6.0 if running on macOS or ROCm (AMD).
  3. If using Mamba2 with quantization, ensure Tensor Parallelism (TP) is set to 1.
  4. Optionally enable the new fastsafetensors loader for improved weight loading performance.

✨ New Features

  • Integrated fastsafetensors loader for faster model weight loading.
  • Added guidance backend for structured output with 'auto' fallback mode.
  • Support for FP8 KV Cache in V1 Engine.
  • Support for tool calling and reasoning parser.
  • Added pipeline parallel support to TransformersModel.
  • Support for Tele-FLM Model.
  • Enabled spec decode for top-p and top-k sampling.
  • Added Triton(ROCm) Attention backend support for Nvidia GPUs.
  • Support for MHA Pallas backend and Tensor parallel MP on TPU.
  • Added --disable-uvicorn-access-log parameter.
  • Added disable-any-whitespace option for xgrammar structured output.

🐛 Bug Fixes

  • Fixed V1 Engine crash when handling requests with duplicate request IDs.
  • Fixed memory usage issues in V1 engine.
  • Fixed embedding assignment for InternVL-based models.
  • Fixed incorrect qwen2.5-vl attention mask pre-computation.
  • Fixed multi-video inference on LLaVA-Onevision.
  • Fixed CUDA kernel index data type in fused_kernels/layernorm_utils.cuh.
  • Fixed issue where 'Total generated tokens' reported 0 when using TGI backend.

🔧 Affected Symbols

V1 EngineTransformersModelSchedulerxgrammarguidanceMamba2InternVLLLaVA-OnevisionQwen2.5-VLTele-FLMRayExecutor

⚡ Deprecations

  • OpenVINO support in core (replaced by external plugin).