v0.21.2
Breaking Changes📦 tokenizersView on GitHub →
⚠ 1 breaking✨ 3 features🐛 7 fixes🔧 3 symbols
Summary
This release focuses on performance optimizations, enabling broader Python no GIL support, and fixing several issues related to onig compilation and training logic.
⚠️ Breaking Changes
- Fix training with special tokens: Behavior when training tokenizers might have changed regarding how special tokens are handled.
Migration Steps
- Review training logic if you rely on specific behavior regarding special tokens during tokenizer training, due to breaking change in PR #1617.
✨ New Features
- Performance optimization enabled by replacing lazy_static with stabilized std::sync::LazyLock.
- Enabled broader Python no GIL support by updating pyo3 and rust-numpy dependencies.
- Added throughput measurement to benchmarks for more consistent performance evaluation.
🐛 Bug Fixes
- Fixed no-onig no-wasm builds.
- Fixed typos in strings and comments.
- Fixed type notation of merges in BPE Python binding.
- Fixed data path in test_continuing_prefix_trainer_mismatch.
- Fixed features blending into a paragraph.
- Fixed Length Pre-Tokenizer issues.
- Consolidated optimization for ahash dary compact str.
🔧 Affected Symbols
BPE Python binding (merges type notation)continuing_prefix_trainer_mismatch (test data path)from_pretrained function (now uses ApiBuilder::from_env())