Change8

0.16.16

📦 unstructuredView on GitHub →
1 features🐛 2 fixes🔧 2 symbols

Summary

This release introduces vectorized data structures for layout processing to improve performance and fixes an issue with NLTK auto-downloading and a patch in pdfminer that caused token splitting errors.

✨ New Features

  • Vectorize layout (inferred, extracted, and OCR) data structure using np.ndarray to store a group of layout elements or text regions instead of using a list of objects, improving memory efficiency and compute speed around layout merging and deduplication.

🐛 Bug Fixes

  • Added auto-download for NLTK data when user imports tokenize, controlled by the AUTO_DOWNLOAD_NLTK flag in tokenize.py.
  • Corrected patch applied to pdfminer to prevent unnecessary token splitting in content streams, which previously caused PDFSyntaxError and sometimes led to failed PDF repair and unnecessary OCR fallback.

🔧 Affected Symbols

tokenize.pypdfminer