Change8

v4.49.0-Gemma-3

📦 transformersView on GitHub →
6 features🔧 4 symbols

Summary

This release introduces Google's Gemma 3 multimodal models to the transformers library, featuring a SigLIP vision encoder and Gemma 2 language decoder with support for high-resolution image cropping and multi-image inference.

Migration Steps

  1. Install the specific release tag using: pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

✨ New Features

  • Added support for Gemma 3, a multimodal vision-language model.
  • Introduced Gemma3ForConditionalGeneration for image+text and image-only inputs.
  • Introduced Gemma3ForCausalLM for optimized text-only generation.
  • Support for multi-image inputs within a single sample.
  • Added 'pan and scan' image cropping (do_pan_and_scan=True) to improve resolution for tasks like DocVQA and OCR.
  • Processor support for apply_chat_template with multimodal input handling.

🔧 Affected Symbols

Gemma3ForConditionalGenerationGemma3ForCausalLMGemma3ProcessorAutoProcessor