Change8

v0.7.2

Breaking Changes
📦 vllmView on GitHub →
1 breaking10 features🐛 8 fixes2 deprecations🔧 9 symbols

Summary

This release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.

⚠️ Breaking Changes

  • Qwen2.5-VL support currently requires a source installation of the Hugging Face transformers library rather than the stable release.

Migration Steps

  1. Install Hugging Face transformers from source to use Qwen2.5-VL.
  2. Set VLLM_LOGITS_PROCESSOR_THREADS environment variable to optimize high batch size structured decoding.
  3. Update compressed-tensors dependency to the latest version.
  4. Use --model-impl=transformers if you need to run text models not natively optimized by vLLM.

✨ New Features

  • Added Qwen2.5-VL support.
  • Added transformers backend support via --model-impl=transformers for arbitrary text models.
  • Added VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding.
  • Enabled MLA (Multi-head Latent Attention) for DeepSeek VL2.
  • Enabled DeepSeek model support on ROCm (AMD).
  • Added XPU bf16 support for Intel GPUs.
  • Added BNB (BitsAndBytes) quantization for Whisper models.
  • Added support for Sparse24Bitmask Compressed Models.
  • Enabled FusedSDPA support for Intel Gaudi (HPU).
  • Added request_success_total counter metric in V1 engine.

🐛 Bug Fixes

  • Improved hash collision avoidance in prefix caching.
  • Fixed moe_wna16 get_quant_method and attention layer quantization issues.
  • Fixed per-token/per-channel quantization for Hopper scaled mm.
  • Fixed loading of fine-tuned models based on Phi-3-Small.
  • Fixed OpenVINO model runner.
  • Fixed ModuleNotFoundError for intel_extension_for_pytorch when tensor-parallel-size > 1.
  • Fixed CI failures for InternVL and Mantis models.
  • Fixed GLM fused module mappings for quantization.

🔧 Affected Symbols

fused_moegrouped_topktorch.compileTransformersModelVLLM_LOGITS_PROCESSOR_THREADSAttention.forwardFinishReasonConstantListpynvml

⚡ Deprecations

  • Discord community (replaced by Developer Slack in documentation).
  • V1 uncache_blocks (reverted in favor of recaching full blocks).