v5.4.0
📦 transformersView on GitHub →
✨ 9 features🔧 9 symbols
Summary
This release introduces a significant number of new models across various domains, including video segmentation (VidEoMT), document processing (UVDoc, SLANeXt, PP-OCRv5 series), text embeddings (Jina-Embeddings-v3), large language models (Mistral 4), and robotics (PI0).
✨ New Features
- Added VidEoMT, a lightweight encoder-only model for online video segmentation built on ViT.
- Added UVDoc model for document image rectification and correction, supporting single input and batched inference.
- Added Jina-Embeddings-v3, a multilingual, multi-task text embedding model based on XLM-RoBERTa supporting RoPE and task-specific LoRA adapters.
- Added Mistral 4, a powerful hybrid MoE model unifying Instruct, Reasoning, and Devstral capabilities, supporting multimodal input and 256k context length.
- Added PI0, a vision-language-action model for robotics manipulation using a flow matching architecture.
- Added SLANeXt series of lightweight models for table structure recognition, with separate weights for wired and wireless tables.
- Added PP-OCRv5_mobile_rec model for efficient, multi-language text recognition supporting complex scenarios like handwriting and vertical text.
- Added PP-OCRv5_server_rec model for efficient, multi-language text recognition supporting complex scenarios like handwriting and vertical text.
- Added PP-OCRv5_mobile_det model for efficient, multi-language text detection supporting diverse scenarios including handwriting, vertical, rotated, and curved text.