v0.23.1
Breaking Changes📦 tokenizersView on GitHub →
⚠ 5 breaking✨ 5 features🐛 3 fixes🔧 8 symbols
Summary
Version 0.23.1 is the first stable release in the 0.23 line, featuring massive performance gains in vocabulary loading, full Python type hints, and stable multi-platform Node.js bindings. Python 3.9 support has been dropped.
⚠️ Breaking Changes
- Python 3.9 support has been dropped; users must use Python >=3.10.
- The 'content' field in 'added_tokens' block of 'tokenizer.json' is now normalized upon insertion via 'add_tokens'. Re-saved files may differ in this block.
- Type stubs are now precise. Methods previously returning 'Any' now return concrete types, which may cause 'mypy --strict' to surface new errors.
- The stub layout has changed from 'tokenizers/submodule/__init__.pyi' to 'tokenizers/submodule.pyi', potentially breaking imports relying on the old structure (e.g., 'RobertaProcessing.__init__').
- On Python 3.14t (free-threaded), setters/getters now return 'PyResult<T>' because the underlying state uses 'Arc<RwLock<Tokenizer>>'. A poisoned lock surfaces as a 'PyException' instead of causing a panic.
Migration Steps
- Upgrade Python version to 3.10 or newer.
- If using 'add_tokens', be aware that re-saving 'tokenizer.json' will reflect normalized content in the 'added_tokens' block.
- If using strict type checking (e.g., 'mypy --strict'), review code for newly surfaced type errors due to precise type hints.
- If relying on internal stub structure, update imports referencing submodule stubs (e.g., 'tokenizers/processor.pyi' instead of 'tokenizers/processor/__init__.pyi').
✨ New Features
- Full Node.js multi-platform wheels shipped for the first time since 2023, supporting 13 platforms.
- Support for Python 3.14 (regular and free-threaded '3.14t') added.
- Full type hints added for every Python class (Tokenizer, AddedToken, Encoding, etc.).
- Unigram sampling now exposes 'alpha' and 'nbest_size' parameters in 'models.Unigram' for subword regularization.
- Weakref support added for the 'Tokenizer' class.
🐛 Bug Fixes
- Fixed Node.js build pipeline to ship multi-platform binaries correctly, resolving 'package-not-found' errors for many users since 2023.
- Fixed issue where importing 'tokenizers' on free-threaded Python 3.14t would force the GIL back on; it now correctly declares 'Py_MOD_GIL_NOT_USED'.
- Fixed concurrent access issues on 3.14t by using 'Arc<RwLock<Tokenizer>>' internally, preventing races between setters and encoders.