v4.49.0
Breaking Changes📦 transformersView on GitHub →
⚠ 1 breaking✨ 12 features🐛 2 fixes🔧 17 symbols
Summary
This release introduces several new models including Helium, Qwen2.5-VL, and Zamba2, alongside a new CLI chat feature and standardized fast image processors.
⚠️ Breaking Changes
- DPT image processor's preprocess method now supports 'segmentation_maps'. Users using positional arguments in this method may experience broken code; switch to keyword arguments to resolve.
Migration Steps
- Review the release notes for newly added models: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth PRO, and RT-DETRv2.
- If you were using older versions or related models that have been superseded (e.g., Qwen2-VL in favor of Qwen2.5-VL), update your model loading configurations to use the new model identifiers.
- If integrating Helium, note its supported languages (English, French, German, Italian, Portuguese, Spanish) and ensure your use case aligns.
- If migrating to Qwen2.5-VL, be aware of architectural improvements like SwiGLU, RMSNorm in the ViT, and enhanced temporal dynamic resolution handling via upgraded MRoPE.
- If using SuperGlue, integrate it alongside SuperPoint for feature matching and pose estimation tasks.
- If migrating to Granite Vision, understand it uses a Granite LM and SigLIP visual encoder, leveraging multiple concatenated vision hidden states.
- If adopting Zamba2, note its hybrid Mamba/Transformer architecture and that it uses the Mistral v0.1 tokenizer.
- If using GOT-OCR 2.0, be aware that this implementation outputs plain text, requiring external processing for complex formats like tables or math formulas.
- If migrating to DAB-DETR, note its use of dynamically updated anchor boxes for improved cross-attention computation in object detection.
- If adopting Depth PRO, recognize it is a foundation model for zero-shot metric monocular depth estimation using a Dinov2 encoder and DPT-like fusion.
- If migrating to RT-DETRv2, expect slight mAP improvements over RT-DETR due to selective multi-scale feature extraction and a discrete sampling operator.
✨ New Features
- Added Helium-1 preview: a 2B parameter lightweight model for edge/mobile devices.
- Added Qwen2.5-VL: vision-language model with window attention and dynamic resolution.
- Added SuperGlue: graph neural network for feature matching between images.
- Added Granite Vision: LLaVA-NeXT variant using Granite LLM and SigLIP encoder.
- Added Zamba2: hybrid Mamba-Transformer models (1.2B, 2.7B, 7B).
- Added GOT-OCR 2.0: general optical character recognition for documents, charts, and formulas.
- Added DAB-DETR: object detection model using dynamic anchor boxes.
- Added Depth PRO: zero-shot metric monocular depth estimation foundation model.
- Added RT-DETRv2: real-time detection transformer with selective multi-scale feature extraction.
- Added 'chat' command to Transformers-CLI for terminal-based model interaction.
- Standardized image processors for OwlViT, Owlv2, OmDet Turbo, and Grounding DINO.
- Introduced fast variant for Qwen2-VL image processor.
🐛 Bug Fixes
- Fixed DPT image processors to correctly support segmentation_maps.
- Removed multi-threaded image conversion for fast image processors to improve stability.
🔧 Affected Symbols
HeliumQwen2_5_VLSuperGlueGraniteVisionZamba2GotOcr2DabDetrDepthProRtDetrv2Transformers-CLIDPTImageProcessorOwlViTProcessorOwlv2ProcessorOmDetTurboProcessorGroundingDinoProcessorImageProcessorFastQwen2VLImageProcessorFast