v5.4.0

📅 Mar 27, 2026📦 transformersView on GitHub →

✨ 9 features🔧 9 symbols

Summary

This release introduces a significant number of new models across various domains, including video segmentation (VidEoMT), document processing (UVDoc, SLANeXt, PP-OCRv5 series), text embeddings (Jina-Embeddings-v3), large language models (Mistral 4), and robotics (PI0).

✨ New Features

Added VidEoMT, a lightweight encoder-only model for online video segmentation built on ViT.
Added UVDoc model for document image rectification and correction, supporting single input and batched inference.
Added Jina-Embeddings-v3, a multilingual, multi-task text embedding model based on XLM-RoBERTa supporting RoPE and task-specific LoRA adapters.
Added Mistral 4, a powerful hybrid MoE model unifying Instruct, Reasoning, and Devstral capabilities, supporting multimodal input and 256k context length.
Added PI0, a vision-language-action model for robotics manipulation using a flow matching architecture.
Added SLANeXt series of lightweight models for table structure recognition, with separate weights for wired and wireless tables.
Added PP-OCRv5_mobile_rec model for efficient, multi-language text recognition supporting complex scenarios like handwriting and vertical text.
Added PP-OCRv5_server_rec model for efficient, multi-language text recognition supporting complex scenarios like handwriting and vertical text.
Added PP-OCRv5_mobile_det model for efficient, multi-language text detection supporting diverse scenarios including handwriting, vertical, rotated, and curved text.

Affected Symbols

VidEoMT UVDoc Jina-Embeddings-V3 Mistral 4 PI0 SLANeXt PP-OCRv5_mobile_rec PP-OCRv5_server_rec PP-OCRv5_mobile_det