v4.56.0
📦 transformersView on GitHub →
✨ 12 features🐛 1 fixes🔧 12 symbols
Summary
This release introduces several major vision and multimodal models including Dino v3, SAM 2, and Ovis 2, alongside a significant refactor of the caching system to optimize memory for sliding window attention.
Migration Steps
- Review the release notes for new model additions: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence-2, SAM 2, and Kosmos 2.5.
- If you wish to use any of the newly added models, consult the Hugging Face Model Hub or the corresponding paper for the correct model identifier and usage instructions.
- No breaking changes were identified in this release summary; existing code using previously supported models should continue to function as expected.
✨ New Features
- Add Dino v3 vision foundation model support.
- Add X-Codec neural audio codec for music continuation and audio tokenization.
- Add Ovis2 multi-modal large language model and processor.
- Add MetaCLIP 2 multilingual CLIP replication.
- Add Florence-2 vision foundation model for prompt-based vision tasks.
- Add Segment Anything 2 (SAM2) for image and video segmentation.
- Add Kosmos-2.5 multimodal literate model for text-intensive images.
- Add HunYuan model support.
- Add ByteDance Seed-OSS model support.
- Add GLM-4.5V model support.
- Refactored caching system to include DynamicSlidingWindowLayer for memory-efficient sliding window attention.
- Stabilized MXFP4 quantization support including CPU inference with dequantization and support for older hardware (sm75+).
🐛 Bug Fixes
- Fix MXFP4 quantizer validation to allow CPU inference with dequantize option.
🔧 Affected Symbols
DinoV3XCodecOvis2MetaCLIP2Florence2SAM2Kosmos2_5HunYuanSeedOSSGLM4_5VDynamicSlidingWindowLayerCache