Change8

v0.23.1

Breaking Changes
📦 tokenizersView on GitHub →
5 breaking5 features🐛 3 fixes🔧 8 symbols

Summary

Version 0.23.1 is the first stable release in the 0.23 line, featuring massive performance gains in vocabulary loading, full Python type hints, and stable multi-platform Node.js bindings. Python 3.9 support has been dropped.

⚠️ Breaking Changes

  • Python 3.9 support has been dropped; users must use Python >=3.10.
  • The 'content' field in 'added_tokens' block of 'tokenizer.json' is now normalized upon insertion via 'add_tokens'. Re-saved files may differ in this block.
  • Type stubs are now precise. Methods previously returning 'Any' now return concrete types, which may cause 'mypy --strict' to surface new errors.
  • The stub layout has changed from 'tokenizers/submodule/__init__.pyi' to 'tokenizers/submodule.pyi', potentially breaking imports relying on the old structure (e.g., 'RobertaProcessing.__init__').
  • On Python 3.14t (free-threaded), setters/getters now return 'PyResult<T>' because the underlying state uses 'Arc<RwLock<Tokenizer>>'. A poisoned lock surfaces as a 'PyException' instead of causing a panic.

Migration Steps

  1. Upgrade Python version to 3.10 or newer.
  2. If using 'add_tokens', be aware that re-saving 'tokenizer.json' will reflect normalized content in the 'added_tokens' block.
  3. If using strict type checking (e.g., 'mypy --strict'), review code for newly surfaced type errors due to precise type hints.
  4. If relying on internal stub structure, update imports referencing submodule stubs (e.g., 'tokenizers/processor.pyi' instead of 'tokenizers/processor/__init__.pyi').

✨ New Features

  • Full Node.js multi-platform wheels shipped for the first time since 2023, supporting 13 platforms.
  • Support for Python 3.14 (regular and free-threaded '3.14t') added.
  • Full type hints added for every Python class (Tokenizer, AddedToken, Encoding, etc.).
  • Unigram sampling now exposes 'alpha' and 'nbest_size' parameters in 'models.Unigram' for subword regularization.
  • Weakref support added for the 'Tokenizer' class.

🐛 Bug Fixes

  • Fixed Node.js build pipeline to ship multi-platform binaries correctly, resolving 'package-not-found' errors for many users since 2023.
  • Fixed issue where importing 'tokenizers' on free-threaded Python 3.14t would force the GIL back on; it now correctly declares 'Py_MOD_GIL_NOT_USED'.
  • Fixed concurrent access issues on 3.14t by using 'Arc<RwLock<Tokenizer>>' internally, preventing races between setters and encoders.

Affected Symbols