Change8

Migrating to vLLM v0.17.0

Version v0.17.0 introduces 2 breaking changes. This guide details how to update your code.

Released: 3/7/2026

2
Breaking Changes
4
Migration Steps
18
Affected Symbols

⚠️ Check Your Code

If you use any of these symbols, you need to read this guide:

Model Runner V2Eagle3ModelStateQwen3.5DeepSeek-VL V2Qwen3/Qwen3.5 reasoning parserQwen2.5-Omni/Qwen3-OmniErnie4.5-VLQwen-VL tokenizerQwen-Omni audio cacheNemotron-3-Nanoallreduce_rms_fusionDCPFA3Mambanum_active_lorasasync TP reduce-scatterNIXL

Breaking Changes

Issue #1

Upgraded to PyTorch 2.10.0, which is a breaking change for environment dependencies.

Issue #2

KV load failure policy default changed from "recompute" to "fail" in large scale serving.

Migration Steps

  1. 1
    If encountering `CUBLAS_STATUS_INVALID_VALUE` on CUDA 12.9+, remove the path to system CUDA shared library files from `LD_LIBRARY_PATH` (e.g., `unset LD_LIBRARY_PATH`).
  2. 2
    Alternatively, install vLLM using `uv pip install vllm --torch-backend=auto` or `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` to resolve CUDA library mismatch.
  3. 3
    Be aware that the KV load failure policy default is now "fail" instead of "recompute"; adjust configurations if necessary for large scale serving.
  4. 4
    If using AMD ROCm, note that the `aiter` package has been renamed to `amd-aiter`.

Release Summary

vLLM v0.17.0 introduces a major upgrade to PyTorch 2.10, integrates FlashAttention 4, and significantly matures Model Runner V2 with features like Pipeline Parallelism. This release also adds full support for the Qwen3.5 model family and introduces new performance tuning flags.

Need More Details?

View the full release notes and all changes for vLLM v0.17.0.

View Full Changelog