v5.3.0
Breaking Changes📦 transformersView on GitHub →
⚠ 2 breaking✨ 9 features🔧 1 symbols
Summary
This release introduces several new models across various modalities, including EuroBERT, VibeVoice ASR, TimesFM 2.5, and multiple vision/audio models. It also includes critical fixes and stabilization for Tensor Parallelism support in decoder-only models.
⚠️ Breaking Changes
- Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
- The `Ernie4.5 VL MoE` model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Migration Steps
- Update TP configurations and conversion mappings due to stabilized Tensor parallelism support for dense and MoE decoder-only models.
- Update any references to the old `Ernie4.5 VL MoE` model class and configuration names to align with vLLM/SGLang conventions.
✨ New Features
- Added EuroBERT, a multilingual encoder model based on a refreshed transformer architecture with bidirectional attention, supporting sequences up to 8192 tokens.
- Added VibeVoice ASR, an automatic speech recognition model supporting joint ASR/diarization/timestamping, customized hotwords, and processing up to 60 minutes of audio.
- Added TimesFM 2.5, a decoder-only time-series foundation model featuring rotary attention, QK normalization, and continuous quantile prediction for zero-shot forecasting.
- Added PP-DocLayoutV2, a lightweight model for document layout analysis focusing on element detection, classification, and reading order prediction.
- Added OlmoHybrid, a hybrid architecture model combining standard transformer attention layers with linear attention layers using the Gated Deltanet for improved efficiency.
- Added ModernVBERT, a Vision-Language encoder combining ModernBert with a SigLIP vision encoder, optimized for visual document understanding.
- Added ColModernVBert, a model for efficient visual document retrieval leveraging ModernVBert to construct multi-vector embeddings from document images.
- Added Higgs Audio V2, an audio foundation model pretrained on extensive audio and text data, supporting expressive audio generation tasks like voice cloning.
- Added Higgs Audio V2 Tokenizer, an audio tokenization model operating at 25 fps with unified 24 kHz training for speech, music, and sound-event clips.