v0.21.2

Breaking Changes

📅 Jun 24, 2025📦 tokenizersView on GitHub →

⚠ 1 breaking✨ 3 features🐛 7 fixes🔧 3 symbols

Summary

This release focuses on performance optimizations, enabling broader Python no GIL support, and fixing several issues related to onig compilation and training logic.

⚠️ Breaking Changes

Fix training with special tokens: Behavior when training tokenizers might have changed regarding how special tokens are handled.

Migration Steps

Review training logic if you rely on specific behavior regarding special tokens during tokenizer training, due to breaking change in PR #1617.

✨ New Features

Performance optimization enabled by replacing lazy_static with stabilized std::sync::LazyLock.
Enabled broader Python no GIL support by updating pyo3 and rust-numpy dependencies.
Added throughput measurement to benchmarks for more consistent performance evaluation.

🐛 Bug Fixes

Fixed no-onig no-wasm builds.
Fixed typos in strings and comments.
Fixed type notation of merges in BPE Python binding.
Fixed data path in test_continuing_prefix_trainer_mismatch.
Fixed features blending into a paragraph.
Fixed Length Pre-Tokenizer issues.
Consolidated optimization for ahash dary compact str.

🔧 Affected Symbols

BPE Python binding (merges type notation)continuing_prefix_trainer_mismatch (test data path)from_pretrained function (now uses ApiBuilder::from_env())