Change8

v0.21.2

Breaking Changes
📦 tokenizersView on GitHub →
1 breaking3 features🐛 7 fixes🔧 3 symbols

Summary

This release focuses on performance optimizations, enabling broader Python no GIL support, and fixing several issues related to onig compilation and training logic.

⚠️ Breaking Changes

  • Fix training with special tokens: Behavior when training tokenizers might have changed regarding how special tokens are handled.

Migration Steps

  1. Review training logic if you rely on specific behavior regarding special tokens during tokenizer training, due to breaking change in PR #1617.

✨ New Features

  • Performance optimization enabled by replacing lazy_static with stabilized std::sync::LazyLock.
  • Enabled broader Python no GIL support by updating pyo3 and rust-numpy dependencies.
  • Added throughput measurement to benchmarks for more consistent performance evaluation.

🐛 Bug Fixes

  • Fixed no-onig no-wasm builds.
  • Fixed typos in strings and comments.
  • Fixed type notation of merges in BPE Python binding.
  • Fixed data path in test_continuing_prefix_trainer_mismatch.
  • Fixed features blending into a paragraph.
  • Fixed Length Pre-Tokenizer issues.
  • Consolidated optimization for ahash dary compact str.

🔧 Affected Symbols

BPE Python binding (merges type notation)continuing_prefix_trainer_mismatch (test data path)from_pretrained function (now uses ApiBuilder::from_env())