Change8

v4.56.0

📦 transformersView on GitHub →
12 features🐛 1 fixes🔧 12 symbols

Summary

This release introduces several major vision and multimodal models including Dino v3, SAM 2, and Ovis 2, alongside a significant refactor of the caching system to optimize memory for sliding window attention.

Migration Steps

  1. Review the release notes for new model additions: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence-2, SAM 2, and Kosmos 2.5.
  2. If you wish to use any of the newly added models, consult the Hugging Face Model Hub or the corresponding paper for the correct model identifier and usage instructions.
  3. No breaking changes were identified in this release summary; existing code using previously supported models should continue to function as expected.

✨ New Features

  • Add Dino v3 vision foundation model support.
  • Add X-Codec neural audio codec for music continuation and audio tokenization.
  • Add Ovis2 multi-modal large language model and processor.
  • Add MetaCLIP 2 multilingual CLIP replication.
  • Add Florence-2 vision foundation model for prompt-based vision tasks.
  • Add Segment Anything 2 (SAM2) for image and video segmentation.
  • Add Kosmos-2.5 multimodal literate model for text-intensive images.
  • Add HunYuan model support.
  • Add ByteDance Seed-OSS model support.
  • Add GLM-4.5V model support.
  • Refactored caching system to include DynamicSlidingWindowLayer for memory-efficient sliding window attention.
  • Stabilized MXFP4 quantization support including CPU inference with dequantization and support for older hardware (sm75+).

🐛 Bug Fixes

  • Fix MXFP4 quantizer validation to allow CPU inference with dequantize option.

🔧 Affected Symbols

DinoV3XCodecOvis2MetaCLIP2Florence2SAM2Kosmos2_5HunYuanSeedOSSGLM4_5VDynamicSlidingWindowLayerCache