v4.49.0-Gemma-3
📦 transformersView on GitHub →
✨ 6 features🔧 4 symbols
Summary
This release introduces Google's Gemma 3 multimodal models to the transformers library, featuring a SigLIP vision encoder and Gemma 2 language decoder with support for high-resolution image cropping and multi-image inference.
Migration Steps
- Install the specific release tag using: pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
✨ New Features
- Added support for Gemma 3, a multimodal vision-language model.
- Introduced Gemma3ForConditionalGeneration for image+text and image-only inputs.
- Introduced Gemma3ForCausalLM for optimized text-only generation.
- Support for multi-image inputs within a single sample.
- Added 'pan and scan' image cropping (do_pan_and_scan=True) to improve resolution for tasks like DocVQA and OCR.
- Processor support for apply_chat_template with multimodal input handling.
🔧 Affected Symbols
Gemma3ForConditionalGenerationGemma3ForCausalLMGemma3ProcessorAutoProcessor