Change8

v0.7.3

Breaking Changes
📦 vllmView on GitHub →
2 breaking9 features🐛 6 fixes🔧 10 symbols

Summary

This release introduces significant DeepSeek optimizations including Multi-Token Prediction and MLA FlashAttention3 support, alongside major V1 Engine updates like LoRA and Pipeline Parallelism. It expands hardware support for TPU, ROCm, and Gaudi while adding several new model architectures and quantization methods.

⚠️ Breaking Changes

  • Separate text-only and vision variants of the same model architecture, which may require updating model loading logic for specific VLM architectures.
  • V1 Engine now uses msgpack for core request serialization, potentially breaking custom integrations that rely on previous serialization formats.

Migration Steps

  1. Ensure PyTorch 2.6 or nightly is used for full compatibility with recent torch.compile enhancements.
  2. Update VLM implementations to use the new merged multimodal processors for Mllama, GLM4V, and Molmo.
  3. If using V1 Engine, verify that any custom request handling is compatible with msgpack serialization.

✨ New Features

  • Support for DeepSeek Multi-Token Prediction (MTP) with 1.69x speedup in low QPS.
  • V1 Engine LoRA support and Pipeline Parallelism support.
  • Initial speculative decoding support with ngrams.
  • Support for Mamba2 (Codestral Mamba), Bamba, and IBM/NASA Prithvi Geospatial models.
  • Added /v1/audio/transcriptions OpenAI-compatible API endpoint.
  • Support for GPTQModel Dynamic [2,3,4,8]bit and Unsloth Dynamic 4bit BnB quantization.
  • NVIDIA nvfp4 quantization support and AMD Per-Token-Activation Per-Channel-Weight FP8 support.
  • V1 Engine support for TPU and initial ROCm support.
  • Choice-based structured output with xgrammar integration.

🐛 Bug Fixes

  • Fixed FlashAttention2 illegal memory access errors.
  • Fixed unsupported FA version check for Turing GPUs.
  • Resolved disaggregated prefill hang caused by communication issues.
  • Fixed multi-round chat errors when using Mistral tokenizers.
  • Fixed missing quant_config in DeepSeek embedding layer.
  • Fixed Qwen2_5_VLForConditionalGeneration packed_modules_mapping.

🔧 Affected Symbols

TransformersModelQwen2_5_VLForConditionalGenerationMllamaGLM4VMolmoPhi3KVCacheManagerModelInputForGPUrotary_embeddingv1.Sampler