v5.4.0
📦 sentence-transformersView on GitHub →
✨ 8 features🔧 6 symbols
Summary
This release introduces major first-class multimodal support for SentenceTransformer and CrossEncoder, enabling unified processing of text, images, audio, and video. The CrossEncoder architecture has also been fully modularized, enabling support for generative rerankers.
Migration Steps
- Review the Migration guide at https://sbert.net/docs/migration_guide.html for updated import paths, renamed parameters, and soft breaking changes, although existing code should generally continue to work with warnings.
✨ New Features
- First-class multimodal support added to SentenceTransformer, enabling computation of embeddings across text, images, audio, and video.
- SentenceTransformer now supports automatic modality detection and preprocessing for mixed-modality inputs.
- Added model.modalities property and model.supports() method to SentenceTransformer for checking supported input types.
- Introduced the Router module for composing separate encoders for different modalities within a SentenceTransformer.
- CrossEncoder now supports multimodal inputs for reranking.
- CrossEncoder architecture is fully modularized, inheriting from BaseModel (torch.nn.Sequential), allowing inspection and customization of module chains.
- Support for generative rerankers (CausalLM-based) in CrossEncoder via the new LogitScore module.
- Flash Attention 2 now automatically skips padding for text-only inputs, improving speed and memory usage for variable length inputs.