0.16.16

📅 Jan 27, 2025📦 unstructuredView on GitHub →

✨ 1 features🐛 2 fixes🔧 2 symbols

Summary

This release introduces vectorized data structures for layout processing to improve performance and fixes an issue with NLTK auto-downloading and a patch in pdfminer that caused token splitting errors.

✨ New Features

Vectorize layout (inferred, extracted, and OCR) data structure using np.ndarray to store a group of layout elements or text regions instead of using a list of objects, improving memory efficiency and compute speed around layout merging and deduplication.

🐛 Bug Fixes

Added auto-download for NLTK data when user imports tokenize, controlled by the AUTO_DOWNLOAD_NLTK flag in tokenize.py.
Corrected patch applied to pdfminer to prevent unnecessary token splitting in content streams, which previously caused PDFSyntaxError and sometimes led to failed PDF repair and unnecessary OCR fallback.

🔧 Affected Symbols

tokenize.pypdfminer