Change8

v5.3.0

Breaking Changes
📦 transformersView on GitHub →
2 breaking9 features🔧 1 symbols

Summary

This release introduces several new models across various modalities, including EuroBERT, VibeVoice ASR, TimesFM 2.5, and multiple vision/audio models. It also includes critical fixes and stabilization for Tensor Parallelism support in decoder-only models.

⚠️ Breaking Changes

  • Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
  • The `Ernie4.5 VL MoE` model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.

Migration Steps

  1. Update TP configurations and conversion mappings due to stabilized Tensor parallelism support for dense and MoE decoder-only models.
  2. Update any references to the old `Ernie4.5 VL MoE` model class and configuration names to align with vLLM/SGLang conventions.

✨ New Features

  • Added EuroBERT, a multilingual encoder model based on a refreshed transformer architecture with bidirectional attention, supporting sequences up to 8192 tokens.
  • Added VibeVoice ASR, an automatic speech recognition model supporting joint ASR/diarization/timestamping, customized hotwords, and processing up to 60 minutes of audio.
  • Added TimesFM 2.5, a decoder-only time-series foundation model featuring rotary attention, QK normalization, and continuous quantile prediction for zero-shot forecasting.
  • Added PP-DocLayoutV2, a lightweight model for document layout analysis focusing on element detection, classification, and reading order prediction.
  • Added OlmoHybrid, a hybrid architecture model combining standard transformer attention layers with linear attention layers using the Gated Deltanet for improved efficiency.
  • Added ModernVBERT, a Vision-Language encoder combining ModernBert with a SigLIP vision encoder, optimized for visual document understanding.
  • Added ColModernVBert, a model for efficient visual document retrieval leveraging ModernVBert to construct multi-vector embeddings from document images.
  • Added Higgs Audio V2, an audio foundation model pretrained on extensive audio and text data, supporting expressive audio generation tasks like voice cloning.
  • Added Higgs Audio V2 Tokenizer, an audio tokenization model operating at 25 fps with unified 24 kHz training for speech, music, and sound-event clips.

Affected Symbols