Change8

v4.51.3-Qwen2.5-Omni-preview

📦 transformersView on GitHub →
6 features🔧 4 symbols

Summary

This release introduces Qwen2.5-Omni, an end-to-end multimodal model capable of perceiving text, images, audio, and video while generating synchronized text and speech responses.

Migration Steps

  1. Install the preview version using: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview

✨ New Features

  • Added support for Qwen2.5-Omni, a unified multimodal model for text, image, audio, and video.
  • Introduced Qwen2_5OmniForConditionalGeneration for joint text and speech generation.
  • Introduced Qwen2_5OmniThinkerForConditionalGeneration for text-only generation to save compute.
  • Introduced Qwen2_5OmniProcessor for handling multimodal inputs and chat templates.
  • Support for streaming multimodal inputs using block-wise processing and TMRoPE (Time-aligned Multimodal RoPE).
  • Support for batch mixed media inference (text, images, audio, and video in the same batch).

Affected Symbols