v4.51.3-Qwen2.5-Omni-preview

📅 Apr 24, 2025📦 transformersView on GitHub →

✨ 6 features🔧 4 symbols

Summary

This release introduces Qwen2.5-Omni, an end-to-end multimodal model capable of perceiving text, images, audio, and video while generating synchronized text and speech responses.

Migration Steps

Install the preview version using: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview

✨ New Features

Added support for Qwen2.5-Omni, a unified multimodal model for text, image, audio, and video.
Introduced Qwen2_5OmniForConditionalGeneration for joint text and speech generation.
Introduced Qwen2_5OmniThinkerForConditionalGeneration for text-only generation to save compute.
Introduced Qwen2_5OmniProcessor for handling multimodal inputs and chat templates.
Support for streaming multimodal inputs using block-wise processing and TMRoPE (Time-aligned Multimodal RoPE).
Support for batch mixed media inference (text, images, audio, and video in the same batch).

Affected Symbols

Qwen2_5OmniForConditionalGeneration Qwen2_5OmniThinkerForConditionalGeneration Qwen2_5OmniProcessor Qwen2_5OmniProcessor.apply_chat_template