0.16.16
📦 unstructuredView on GitHub →
✨ 1 features🐛 2 fixes🔧 2 symbols
Summary
This release introduces vectorized data structures for layout processing to improve performance and fixes an issue with NLTK auto-downloading and a patch in pdfminer that caused token splitting errors.
✨ New Features
- Vectorize layout (inferred, extracted, and OCR) data structure using np.ndarray to store a group of layout elements or text regions instead of using a list of objects, improving memory efficiency and compute speed around layout merging and deduplication.
🐛 Bug Fixes
- Added auto-download for NLTK data when user imports tokenize, controlled by the AUTO_DOWNLOAD_NLTK flag in tokenize.py.
- Corrected patch applied to pdfminer to prevent unnecessary token splitting in content streams, which previously caused PDFSyntaxError and sometimes led to failed PDF repair and unnecessary OCR fallback.
🔧 Affected Symbols
tokenize.pypdfminer