Change8

v5.1.0

Breaking Changes
📦 transformersView on GitHub →
6 breaking8 features🐛 21 fixes1 deprecations🔧 16 symbols

Summary

This release introduces four major new models: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, and GlmOcr. It also includes several breaking changes related to model structure adjustments, cache initialization for sliding window attention, and configuration cleanup.

⚠️ Breaking Changes

  • T5Gemma2 model structure was modified to ensure attn implementation is set for all sub-configs, breaking previous assumptions where encoder text config attn was not set correctly. Users relying on the previous structure might need to verify attention settings.
  • Generation cache initialization was refactored to properly respect sliding window configurations. Models using sliding window attention (like Afmoe) will now enforce window size limits during generation, potentially changing sequence length handling.
  • Redundant configuration attributes for backbone loading were removed, consolidating logic into a single source of truth: `config.backbone_config`. Models must now rely on this attribute for backbone loading.
  • The DETR model structure was refactored to align with other vision models in the library.
  • Floating-point precision in JanusImageProcessor resize was fixed by replacing `int()` with `round()`, which may result in light numerical differences in output.
  • The deprecated class `AnnotionFormat` was removed in favor of `AnnotationFormat`.

Migration Steps

  1. If using T5Gemma2, verify that attention implementation settings are correctly propagated across sub-configs.
  2. Review generation code for models using sliding window attention, as window size limits are now strictly enforced.
  3. Update model configuration files to use only `config.backbone_config` as the single source of truth for backbone loading.
  4. Replace usage of the removed `AnnotionFormat` with `AnnotationFormat`.

✨ New Features

  • Added support for the K-EXAONE (EXAONE-MoE) multilingual language model.
  • Added support for the PP-DocLayoutV3 unified layout analysis model.
  • Added support for the Youtu-LLM small, powerful LLM with long context and agentic capabilities.
  • Added support for the GlmOcr multimodal OCR model.
  • Added moonshine streaming capability.
  • Allowed bi-directional attention for all models.
  • Added support for loading T5Gemma2Encoder with AutoModel.
  • Added EoMT with DINOv3 backbone.

🐛 Bug Fixes

  • Migrated legacy `segmentation_indices` to `out_indices` in BeitConfig.
  • Removed SDPA workarounds for torch 2.4+.
  • Added `use_deterministic` to guarantee consistency for the youtu-llm model.
  • Added `compatible_model_types` to suppress model type mismatch warnings.
  • Fixed T5 v1.1 detection.
  • Fixed scheduler initialization order.
  • Fixed accelerate integration import.
  • Fixed dtype in image-text-to-text pipe.
  • Prevented initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3.
  • Fixed AttributeError for Qwen3_omni_moe.
  • Fixed norm_eps dtype.
  • Fixed Llava onevision output alignment for tests and added `image_sizes` input param.
  • Fixed CLIPOutput attentions not being returned.
  • Fixed crash of custom models in Notebook or Repl.
  • Fixed gptoss tp crash.
  • Kept order of incoming requests in CB.
  • Fixed Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!).
  • Removed `num_frames` in ASR pipeline.
  • Removed ipex and ccl for xpu and cpu.
  • Fixed t5 failures.
  • Fixed FP8Expert for Qwen.

Affected Symbols

⚡ Deprecations

  • The cache class is deprecated.