v0.7.2

Breaking Changes

📅 Feb 6, 2025📦 vllmView on GitHub →

⚠ 1 breaking✨ 10 features🐛 8 fixes⚡ 2 deprecations🔧 9 symbols

Summary

This release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.

⚠️ Breaking Changes

Qwen2.5-VL support currently requires a source installation of the Hugging Face transformers library rather than the stable release.

Migration Steps

Install Hugging Face transformers from source to use Qwen2.5-VL.
Set VLLM_LOGITS_PROCESSOR_THREADS environment variable to optimize high batch size structured decoding.
Update compressed-tensors dependency to the latest version.
Use --model-impl=transformers if you need to run text models not natively optimized by vLLM.

✨ New Features

Added Qwen2.5-VL support.
Added transformers backend support via --model-impl=transformers for arbitrary text models.
Added VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding.
Enabled MLA (Multi-head Latent Attention) for DeepSeek VL2.
Enabled DeepSeek model support on ROCm (AMD).
Added XPU bf16 support for Intel GPUs.
Added BNB (BitsAndBytes) quantization for Whisper models.
Added support for Sparse24Bitmask Compressed Models.
Enabled FusedSDPA support for Intel Gaudi (HPU).
Added request_success_total counter metric in V1 engine.

🐛 Bug Fixes

Improved hash collision avoidance in prefix caching.
Fixed moe_wna16 get_quant_method and attention layer quantization issues.
Fixed per-token/per-channel quantization for Hopper scaled mm.
Fixed loading of fine-tuned models based on Phi-3-Small.
Fixed OpenVINO model runner.
Fixed ModuleNotFoundError for intel_extension_for_pytorch when tensor-parallel-size > 1.
Fixed CI failures for InternVL and Mantis models.
Fixed GLM fused module mappings for quantization.

🔧 Affected Symbols

fused_moegrouped_topktorch.compileTransformersModelVLLM_LOGITS_PROCESSOR_THREADSAttention.forwardFinishReasonConstantListpynvml

⚡ Deprecations

Discord community (replaced by Developer Slack in documentation).
V1 uncache_blocks (reverted in favor of recaching full blocks).