v5.6.0
Breaking Changes📦 transformersView on GitHub →
⚠ 1 breaking✨ 9 features🐛 12 fixes🔧 12 symbols
Summary
Version 5.6.0 introduces four new models: OpenAI Privacy Filter, Qianfan-OCR, SAM3-LiteText, and SLANet. This release also significantly enhances `transformers serve` with new endpoints and multimodal support, alongside several critical fixes in vision, parallelization, and tokenization.
⚠️ Breaking Changes
- The internal `rotary_fn` is no longer registered as a hidden kernel function. Code referencing `self.rotary_fn(...)` within an Attention module must be updated to call the function directly instead.
Migration Steps
- Update any code referencing `self.rotary_fn(...)` within an Attention module to call the function directly instead of through `self.rotary_fn`.
✨ New Features
- Added OpenAI Privacy Filter, a bidirectional token-classification model for PII detection and masking.
- Added Qianfan-OCR, a 4B-parameter end-to-end document intelligence model supporting prompt-driven tasks like structured parsing and table extraction.
- Added SAM3-LiteText, a lightweight variant of SAM3 using a MobileCLIP-based text encoder for efficient vision-language segmentation.
- Added SLANet and SLANet_plus models for accurate table structure recognition.
- The `transformers serve` command now includes a new `/v1/completions` endpoint for legacy text completion.
- Added multimodal support (audio and video inputs) to `transformers serve`.
- Improved tool-calling support in `transformers serve` via `parse_response` and proper forwarding of `tool_calls`/`tool_call_id` fields.
- Added support for loading adapters with Tensor Parallelism.
- Added MoE configuration to the Gemma4 Tensor Parallelism plan.
🐛 Bug Fixes
- Fixed Qwen2.5-VL temporal RoPE scaling being incorrectly applied to still images.
- Fixed missing or mismatched image processor backends for Emu3 and BLIP models.
- Resolved issues related to modular image processor class duplication.
- Prevented accelerate from incorrectly splitting vision encoders in PeVideo/PeAudioVideo models.
- Improved image loading performance by leveraging torchvision's native `decode_image` in the torchvision backend (up to ~17% speedup).
- Fixed Expert Parallelism issues: RouterParallel shape, `tp_plan` property, and grouped_mm sentinels.
- Fixed NaN weights appearing on non-rank-0 FSDP processes.
- Fixed a resize failure in PP-DocLayoutV3 caused by zero-sized masks.
- Fixed a docstring typo in streamer classes ('tokenized' -> 'tokenizer').
- Resolved a Kimi-K2.5 tokenizer regression and `_patch_mistral_regex` AttributeError.
- Patched a streaming generation crash for `Qwen3VLProcessor` caused by incorrect `_tokenizer` attribute access.
- Fixed a global state leak in the tokenizer registry during tests.