v5.2.0
Breaking Changes📦 transformersView on GitHub →
⚠ 2 breaking✨ 4 features🐛 18 fixes🔧 17 symbols
Summary
This release introduces several major new models including VoxtralRealtime, GLM-5, and Qwen3.5, alongside significant internal refactoring, particularly around attention mechanisms and trainer stability.
⚠️ Breaking Changes
- The attention mask interface has been updated everywhere, requiring review if custom attention logic was used.
- ModernBERT's default attention implementation no longer uses FA (Flash Attention), which might affect performance or behavior if FA was implicitly relied upon.
Migration Steps
- Review and update any custom attention logic due to the new attn mask interface.
- If relying on ModernBERT's default attention implementation, be aware it no longer uses FA.
✨ New Features
- Introduction of VoxtralRealtime, a streaming speech-to-text model from Mistral AI for low-latency, incremental ASR.
- Addition of GLM-5 (GlmMoeDsa), featuring 744B parameters (40B active) and integrating DeepSeek Sparse Attention (DSA) for improved efficiency on long-context tasks.
- Introduction of Qwen3.5 and Qwen3.5 Moe (specifically Qwen3.5-397B-A17B), a native vision-language model with a hybrid architecture (linear attention via Gated Delta Networks + sparse MoE) achieving high efficiency and expanded language support to 201 dialects.
- Addition of VibeVoice Acoustic Tokenizer framework for synthesizing high-fidelity, long-form speech using a next-token diffusion approach.
🐛 Bug Fixes
- Fixed `convert_rope_params_to_dict` to correctly use `rope_theta` from the configuration.
- Fixed Qwen RMS norms in modular dependencies.
- Fixed BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests.
- Prevented AutoTokenizer type mismatch caused by directory name substring matching.
- Fixed DeepSpeed model preparation logic within the Trainer class.
- Fixed incorrect timestamp calculation in Qwen3VL Processor.
- Fixed gptoss crash when using tensor parallelism (tp).
- Removed `batch_split` from EncoderDecoderCache.
- Removed unnecessary code to make MoE compatible with full graph compile.
- Updated ModelType for Unigram tokenizer.
- Fixed video interpolation in `pe_audio_video`.
- Fixed looking for `pad_token_id` in the wrong place for Llama4.
- Fixed cardinality error for DETR models lacking an explicit background class.
- Fixed bugs in xLSTM preventing small model training.
- Fixed GlmMoeDsaConfig default `mlp_layer_types` during modular conversion.
- Fixed loading procedure in `MistralCommonBackend`.
- Added fallback to slow path and warning instead of erroring out in Jamba implementation.
- Fixed SwanLab callback to forward resume initialization arguments.