v4.51.3-Qwen2.5-Omni-preview
📦 transformersView on GitHub →
✨ 6 features🔧 4 symbols
Summary
This release introduces Qwen2.5-Omni, an end-to-end multimodal model capable of perceiving text, images, audio, and video while generating synchronized text and speech responses.
Migration Steps
- Install the preview version using: pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
✨ New Features
- Added support for Qwen2.5-Omni, a unified multimodal model for text, image, audio, and video.
- Introduced Qwen2_5OmniForConditionalGeneration for joint text and speech generation.
- Introduced Qwen2_5OmniThinkerForConditionalGeneration for text-only generation to save compute.
- Introduced Qwen2_5OmniProcessor for handling multimodal inputs and chat templates.
- Support for streaming multimodal inputs using block-wise processing and TMRoPE (Time-aligned Multimodal RoPE).
- Support for batch mixed media inference (text, images, audio, and video in the same batch).