Change8

0.16.13

Breaking Changes
📦 unstructuredView on GitHub →
1 breaking1 features🐛 1 fixes🔧 1 symbols

Summary

This release introduces character-level filtering for Tesseract output and resolves an issue with NLTK asset usage in Docker images, while removing automatic NLTK package downloading.

⚠️ Breaking Changes

  • Removed automatic downloading of NLTK packages if missing. Users must now ensure NLTK data is present before use.

Migration Steps

  1. If relying on automatic NLTK package downloads, ensure necessary NLTK data is manually downloaded or present in the environment (especially in Docker images).

✨ New Features

  • Added character-level filtering for tesseract output, configurable via the TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD environment variable.

🐛 Bug Fixes

  • Fixed NLTK Download issue to correctly use nltk assets within docker images.

🔧 Affected Symbols

TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD