Change8

v4.52.1

📦 transformersView on GitHub →
5 features🔧 7 symbols

Summary

This release introduces several major multimodal and specialized models, including the Qwen2.5-Omni streaming model, the high-precision SAM-HQ segmenter, and the D-FINE real-time object detector.

Migration Steps

  1. Review the release notes for newly introduced models: Qwen2.5-Omni, SAM-HQ, and GraniteMoeHybrid.
  2. If you intend to use Qwen2.5-Omni, note that it is a unified multimodal model supporting text, images, audio, and video, and features streaming capabilities via the Thinker-Talker architecture.
  3. If you intend to use SAM-HQ, replace instances of the original SAM model where higher quality segmentation masks are required, as SAM-HQ maintains promptability and zero-shot capability while improving mask detail.
  4. If you intend to use GraniteMoeHybrid, be aware that its decoding layers utilize state space layers or MoE attention layers with shared experts, and by default, attention layers do not use positional embeddings (though the release notes are cut short, this implies a potential configuration difference from previous Granite models).

✨ New Features

  • Added Qwen2.5-Omni: A unified multimodal model for text, image, audio, and video with streaming capabilities and TMRoPE (Time-aligned Multimodal RoPE).
  • Added SAM-HQ (High-Quality Segment Anything Model): Enhances SAM with a High-Quality Output Token and Global-local Feature Fusion for better mask precision.
  • Added GraniteMoeHybrid: A model combining state space layers and MoE attention layers with shared experts.
  • Added D-FINE: A real-time object detector using Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
  • Added CSM (Conversational Speech Model): An open-source contextual text-to-speech model.

🔧 Affected Symbols

Qwen2.5-OmniSAM-HQGraniteMoeHybridGraniteMoeSharedModelBambaD-FINECSM