v0.7.2
Breaking Changes📦 vllmView on GitHub →
⚠ 1 breaking✨ 10 features🐛 8 fixes⚡ 2 deprecations🔧 9 symbols
Summary
This release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.
⚠️ Breaking Changes
- Qwen2.5-VL support currently requires a source installation of the Hugging Face transformers library rather than the stable release.
Migration Steps
- Install Hugging Face transformers from source to use Qwen2.5-VL.
- Set VLLM_LOGITS_PROCESSOR_THREADS environment variable to optimize high batch size structured decoding.
- Update compressed-tensors dependency to the latest version.
- Use --model-impl=transformers if you need to run text models not natively optimized by vLLM.
✨ New Features
- Added Qwen2.5-VL support.
- Added transformers backend support via --model-impl=transformers for arbitrary text models.
- Added VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding.
- Enabled MLA (Multi-head Latent Attention) for DeepSeek VL2.
- Enabled DeepSeek model support on ROCm (AMD).
- Added XPU bf16 support for Intel GPUs.
- Added BNB (BitsAndBytes) quantization for Whisper models.
- Added support for Sparse24Bitmask Compressed Models.
- Enabled FusedSDPA support for Intel Gaudi (HPU).
- Added request_success_total counter metric in V1 engine.
🐛 Bug Fixes
- Improved hash collision avoidance in prefix caching.
- Fixed moe_wna16 get_quant_method and attention layer quantization issues.
- Fixed per-token/per-channel quantization for Hopper scaled mm.
- Fixed loading of fine-tuned models based on Phi-3-Small.
- Fixed OpenVINO model runner.
- Fixed ModuleNotFoundError for intel_extension_for_pytorch when tensor-parallel-size > 1.
- Fixed CI failures for InternVL and Mantis models.
- Fixed GLM fused module mappings for quantization.
🔧 Affected Symbols
fused_moegrouped_topktorch.compileTransformersModelVLLM_LOGITS_PROCESSOR_THREADSAttention.forwardFinishReasonConstantListpynvml⚡ Deprecations
- Discord community (replaced by Developer Slack in documentation).
- V1 uncache_blocks (reverted in favor of recaching full blocks).