Change8

v4.53.0

📦 transformersView on GitHub →
8 features🔧 9 symbols

Summary

Release v4.53.0 introduces several major model architectures including Gemma 3n, Dia TTS, Kyutai STT, and the massive 456B MiniMax model. The update focuses heavily on multimodal capabilities, efficient parameter usage, and long-context support.

Migration Steps

  1. Review the release notes for new model additions: Gemma3n, Dia, Kyutai Speech-to-Text, V-JEPA 2, Arcee, ColQwen2, and MiniMax.
  2. If you were using older multimodal pipelines or specific model architectures, check the documentation for the newly supported models (e.g., Gemma3n for multimodal input, Dia for TTS, Kyutai for STT, Arcee for Llama-like models with ReLU² activation).
  3. If migrating to use Gemma3n, ensure your input handling supports multimodal data (text, image, video, audio) as shown in the example, using appropriate soft tokens like '<image_soft_token>' if necessary for image input.

✨ New Features

  • Added Gemma 3n: Multimodal models (text, image, video, audio) with selective parameter activation for low-resource devices.
  • Added Dia: A 1.6B parameter text-to-speech (TTS) model with emotion and tone control via audio conditioning.
  • Added Kyutai Speech-to-Text: Architecture based on Mimi codec and Moshi-like decoder (1B and 2.6B variants).
  • Added V-JEPA 2: Self-supervised video encoders for motion understanding and robot manipulation tasks.
  • Added Arcee: Decoder-only transformer using ReLU² activation for improved training efficiency.
  • Added ColQwen2: Visual document retrieval model using Qwen2-VL backbone and late interaction similarity.
  • Added MiniMax: 456B parameter hybrid architecture (Lightning Attention + MoE) supporting up to 4M token context.
  • Added Encoder-Decoder Gemma (T5Gemma): Adaptation of pretrained decoder-only models into encoder-decoder structures.

🔧 Affected Symbols

Gemma3nDiaKyutaiSTTV-JEPA 2ArceeColQwen2MiniMaxT5Gemmapipeline