Change8

v4.49.0

Breaking Changes
📦 transformersView on GitHub →
1 breaking12 features🐛 2 fixes🔧 17 symbols

Summary

This release introduces several new models including Helium, Qwen2.5-VL, and Zamba2, alongside a new CLI chat feature and standardized fast image processors.

⚠️ Breaking Changes

  • DPT image processor's preprocess method now supports 'segmentation_maps'. Users using positional arguments in this method may experience broken code; switch to keyword arguments to resolve.

Migration Steps

  1. Review the release notes for newly added models: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth PRO, and RT-DETRv2.
  2. If you were using older versions or related models that have been superseded (e.g., Qwen2-VL in favor of Qwen2.5-VL), update your model loading configurations to use the new model identifiers.
  3. If integrating Helium, note its supported languages (English, French, German, Italian, Portuguese, Spanish) and ensure your use case aligns.
  4. If migrating to Qwen2.5-VL, be aware of architectural improvements like SwiGLU, RMSNorm in the ViT, and enhanced temporal dynamic resolution handling via upgraded MRoPE.
  5. If using SuperGlue, integrate it alongside SuperPoint for feature matching and pose estimation tasks.
  6. If migrating to Granite Vision, understand it uses a Granite LM and SigLIP visual encoder, leveraging multiple concatenated vision hidden states.
  7. If adopting Zamba2, note its hybrid Mamba/Transformer architecture and that it uses the Mistral v0.1 tokenizer.
  8. If using GOT-OCR 2.0, be aware that this implementation outputs plain text, requiring external processing for complex formats like tables or math formulas.
  9. If migrating to DAB-DETR, note its use of dynamically updated anchor boxes for improved cross-attention computation in object detection.
  10. If adopting Depth PRO, recognize it is a foundation model for zero-shot metric monocular depth estimation using a Dinov2 encoder and DPT-like fusion.
  11. If migrating to RT-DETRv2, expect slight mAP improvements over RT-DETR due to selective multi-scale feature extraction and a discrete sampling operator.

✨ New Features

  • Added Helium-1 preview: a 2B parameter lightweight model for edge/mobile devices.
  • Added Qwen2.5-VL: vision-language model with window attention and dynamic resolution.
  • Added SuperGlue: graph neural network for feature matching between images.
  • Added Granite Vision: LLaVA-NeXT variant using Granite LLM and SigLIP encoder.
  • Added Zamba2: hybrid Mamba-Transformer models (1.2B, 2.7B, 7B).
  • Added GOT-OCR 2.0: general optical character recognition for documents, charts, and formulas.
  • Added DAB-DETR: object detection model using dynamic anchor boxes.
  • Added Depth PRO: zero-shot metric monocular depth estimation foundation model.
  • Added RT-DETRv2: real-time detection transformer with selective multi-scale feature extraction.
  • Added 'chat' command to Transformers-CLI for terminal-based model interaction.
  • Standardized image processors for OwlViT, Owlv2, OmDet Turbo, and Grounding DINO.
  • Introduced fast variant for Qwen2-VL image processor.

🐛 Bug Fixes

  • Fixed DPT image processors to correctly support segmentation_maps.
  • Removed multi-threaded image conversion for fast image processors to improve stability.

🔧 Affected Symbols

HeliumQwen2_5_VLSuperGlueGraniteVisionZamba2GotOcr2DabDetrDepthProRtDetrv2Transformers-CLIDPTImageProcessorOwlViTProcessorOwlv2ProcessorOmDetTurboProcessorGroundingDinoProcessorImageProcessorFastQwen2VLImageProcessorFast