Change8

v5.2.0

Breaking Changes
📦 transformersView on GitHub →
2 breaking4 features🐛 18 fixes🔧 17 symbols

Summary

This release introduces several major new models including VoxtralRealtime, GLM-5, and Qwen3.5, alongside significant internal refactoring, particularly around attention mechanisms and trainer stability.

⚠️ Breaking Changes

  • The attention mask interface has been updated everywhere, requiring review if custom attention logic was used.
  • ModernBERT's default attention implementation no longer uses FA (Flash Attention), which might affect performance or behavior if FA was implicitly relied upon.

Migration Steps

  1. Review and update any custom attention logic due to the new attn mask interface.
  2. If relying on ModernBERT's default attention implementation, be aware it no longer uses FA.

✨ New Features

  • Introduction of VoxtralRealtime, a streaming speech-to-text model from Mistral AI for low-latency, incremental ASR.
  • Addition of GLM-5 (GlmMoeDsa), featuring 744B parameters (40B active) and integrating DeepSeek Sparse Attention (DSA) for improved efficiency on long-context tasks.
  • Introduction of Qwen3.5 and Qwen3.5 Moe (specifically Qwen3.5-397B-A17B), a native vision-language model with a hybrid architecture (linear attention via Gated Delta Networks + sparse MoE) achieving high efficiency and expanded language support to 201 dialects.
  • Addition of VibeVoice Acoustic Tokenizer framework for synthesizing high-fidelity, long-form speech using a next-token diffusion approach.

🐛 Bug Fixes

  • Fixed `convert_rope_params_to_dict` to correctly use `rope_theta` from the configuration.
  • Fixed Qwen RMS norms in modular dependencies.
  • Fixed BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests.
  • Prevented AutoTokenizer type mismatch caused by directory name substring matching.
  • Fixed DeepSpeed model preparation logic within the Trainer class.
  • Fixed incorrect timestamp calculation in Qwen3VL Processor.
  • Fixed gptoss crash when using tensor parallelism (tp).
  • Removed `batch_split` from EncoderDecoderCache.
  • Removed unnecessary code to make MoE compatible with full graph compile.
  • Updated ModelType for Unigram tokenizer.
  • Fixed video interpolation in `pe_audio_video`.
  • Fixed looking for `pad_token_id` in the wrong place for Llama4.
  • Fixed cardinality error for DETR models lacking an explicit background class.
  • Fixed bugs in xLSTM preventing small model training.
  • Fixed GlmMoeDsaConfig default `mlp_layer_types` during modular conversion.
  • Fixed loading procedure in `MistralCommonBackend`.
  • Added fallback to slow path and warning instead of erroring out in Jamba implementation.
  • Fixed SwanLab callback to forward resume initialization arguments.

Affected Symbols