v0.17.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 18 features🐛 11 fixes🔧 18 symbols
Summary
vLLM v0.17.0 introduces a major upgrade to PyTorch 2.10, integrates FlashAttention 4, and significantly matures Model Runner V2 with features like Pipeline Parallelism. This release also adds full support for the Qwen3.5 model family and introduces new performance tuning flags.
⚠️ Breaking Changes
- Upgraded to PyTorch 2.10.0, which is a breaking change for environment dependencies.
- KV load failure policy default changed from "recompute" to "fail" in large scale serving.
Migration Steps
- If encountering `CUBLAS_STATUS_INVALID_VALUE` on CUDA 12.9+, remove the path to system CUDA shared library files from `LD_LIBRARY_PATH` (e.g., `unset LD_LIBRARY_PATH`).
- Alternatively, install vLLM using `uv pip install vllm --torch-backend=auto` or `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` to resolve CUDA library mismatch.
- Be aware that the KV load failure policy default is now "fail" instead of "recompute"; adjust configurations if necessary for large scale serving.
- If using AMD ROCm, note that the `aiter` package has been renamed to `amd-aiter`.
✨ New Features
- Support for the FlashAttention 4 backend.
- Model Runner V2 maturation including Pipeline Parallel, Decode Context Parallel, Eagle3 speculative decoding with CUDA graphs, pooling model support, piecewise & mixed CUDA graph capture, DP+EP for spec decoding, and a new ModelState architecture.
- Full support for the Qwen3.5 model family, including GDN, FP8 quantization, MTP speculative decoding, and reasoning parser support.
- New `--performance-mode {balanced, interactivity, throughput}` flag for simplified performance tuning.
- Support for Anthropic thinking blocks, `count_tokens` API, and `tool_choice=none`.
- Weight offloading V2 now hides onloading latency via prefetching, supports selective CPU weight offloading, and CPU offloading without pinned memory doubling.
- Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models.
- Ability to load quantized LoRA adapters (e.g. QLoRA) directly.
- Extensive compatibility work for HuggingFace Transformers v5.
- Integration of FlashInfer Sparse MLA backend.
- Triton-based top-k and top-p sampler kernels.
- Helion kernel framework integration with autotuning infrastructure.
- Support for new model architectures including Qwen3.5, COLQwen3, ColModernVBERT, Ring 2.5, skt/A.X-K1, Ovis 2.6, and several NVIDIA Nemotron variants.
- Support for ASR models: FunASR, FireRedASR2, Qwen3-ASR realtime streaming.
- Multimodal support enhancements: OpenPangu-VL video input, audio chunking, Parakeet audio encoder, MiniCPM-o flagos.
- Performance improvements across NVIDIA (SM100/SM120 optimizations, DeepGEMM swapAB), AMD ROCm (AITER support, MXFP4 MoE pre-shuffling), Intel XPU (CUDA graph support), and CPU (ARM BF16 cross-compilation, s390x intrinsics).
- Pipeline Parallel async send/recv showing 2.9% throughput improvement.
- NIXL: Token-based IPC API and NUMA core binding support.
🐛 Bug Fixes
- Fixes for Qwen3/Qwen3.5 reasoning parser.
- Fixes for Qwen2.5-Omni/Qwen3-Omni mixed-modality issues.
- Fix for Ernie4.5-VL garbled output.
- Fixes for Qwen-VL tokenizer and Qwen-Omni audio cache.
- Fix for Nemotron-3-Nano NVFP4 accuracy with TP>1.
- Fix for allreduce_rms_fusion being enabled by default with PP > 1.
- Fix for DCP + FA3 crash.
- Fix for prefix caching for Mamba "all" mode.
- Fix for num_active_loras.
- Fix for async TP reduce-scatter reduction.
- Fix for cross-node data parallelism message queue.