Change8

0.17.11-dev1

Breaking Changes
📦 unstructuredView on GitHub →
1 breaking5 features🐛 11 fixes1 deprecations🔧 3 symbols

Summary

This release drops support for Python 3.9 due to dependency conflicts and includes numerous bug fixes related to PDF parsing, stability, and security vulnerabilities. It also deprecates the `stage_for_label_studio` utility.

⚠️ Breaking Changes

  • Dropped support for Python 3.9 due to dependency conflicts. Users must upgrade to Python 3.10 or newer.

Migration Steps

  1. If you are using Python 3.9, you must upgrade your environment to Python 3.10 or higher to continue using this version.

✨ New Features

  • File prefix matching is now used to verify the presence of DOCX, PPTX, and XLSX files instead of relying on standard file names.
  • Added convenience update for `unstructured-get-json.sh`.
  • Added option to change the default output directory for `unstructured-get-json.sh`.
  • Inference models have been bumped (version update implied).
  • Recompiled on arm64 to meet minimum requirements.

🐛 Bug Fixes

  • Hi-res PDF parsing now only extracts uncategorized text for extracted elements.
  • Fixed unstable/random sorting in `sort_page_element` to ensure stable sorting.
  • Addressed CVEs found in dependencies.
  • Fixed failing build related to missing `diffstat` command in the test_json_to_html CI job.
  • Fixed a build failure.
  • Properly handles cases where an element's text is None.
  • Fixed a Pillow error encountered during PNG image extraction.
  • Throws a validation error when JSON is passed with invalid unstructured JSON structure.
  • Resolved logger library warnings.
  • Fixed issue where chunking text resulted in an AttributeError ('NoneType' object has no attribute 'strip').
  • Bumped the `requests` package to address CVEs.

🔧 Affected Symbols

stage_for_label_studiosort_page_elementpdfminer_utils.py

⚡ Deprecations

  • The function/method `stage_for_label_studio` is deprecated.