0.16.13
Breaking Changes📦 unstructuredView on GitHub →
⚠ 1 breaking✨ 1 features🐛 1 fixes🔧 1 symbols
Summary
This release introduces character-level filtering for Tesseract output and resolves an issue with NLTK asset usage in Docker images, while removing automatic NLTK package downloading.
⚠️ Breaking Changes
- Removed automatic downloading of NLTK packages if missing. Users must now ensure NLTK data is present before use.
Migration Steps
- If relying on automatic NLTK package downloads, ensure necessary NLTK data is manually downloaded or present in the environment (especially in Docker images).
✨ New Features
- Added character-level filtering for tesseract output, configurable via the TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD environment variable.
🐛 Bug Fixes
- Fixed NLTK Download issue to correctly use nltk assets within docker images.
🔧 Affected Symbols
TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD