Change8

v0.22.2

📦 tokenizersView on GitHub →
2 features🐛 3 fixes🔧 1 symbols

Summary

This release focuses on performance improvements, achieving 4x to 8x faster vocab loading with many added tokens due to GIL-free operations, alongside general typing and bug fixes.

Migration Steps

  1. If you rely on specific internal behaviors related to token deserialization or normalization, review the changes introduced in PR #1891 and #1884.

✨ New Features

  • Improved typing support.
  • Significantly faster vocabulary loading (4x to 8x faster) when dealing with many added tokens, now utilizing GIL-free operations.

🐛 Bug Fixes

  • Fixed deserialization of added tokens.
  • Ensured `normalize_str` is used within `BaseTokenizer.normalize`.
  • Removed runtime stderr warning from Python bindings.

🔧 Affected Symbols

BaseTokenizer.normalize