0.17.11-dev1

Breaking Changes

📅 Jun 13, 2025📦 unstructuredView on GitHub →

⚠ 1 breaking✨ 5 features🐛 11 fixes⚡ 1 deprecations🔧 3 symbols

Summary

This release drops support for Python 3.9 due to dependency conflicts and includes numerous bug fixes related to PDF parsing, stability, and security vulnerabilities. It also deprecates the `stage_for_label_studio` utility.

⚠️ Breaking Changes

Dropped support for Python 3.9 due to dependency conflicts. Users must upgrade to Python 3.10 or newer.

Migration Steps

If you are using Python 3.9, you must upgrade your environment to Python 3.10 or higher to continue using this version.

✨ New Features

File prefix matching is now used to verify the presence of DOCX, PPTX, and XLSX files instead of relying on standard file names.
Added convenience update for `unstructured-get-json.sh`.
Added option to change the default output directory for `unstructured-get-json.sh`.
Inference models have been bumped (version update implied).
Recompiled on arm64 to meet minimum requirements.

🐛 Bug Fixes

Hi-res PDF parsing now only extracts uncategorized text for extracted elements.
Fixed unstable/random sorting in `sort_page_element` to ensure stable sorting.
Addressed CVEs found in dependencies.
Fixed failing build related to missing `diffstat` command in the test_json_to_html CI job.
Fixed a build failure.
Properly handles cases where an element's text is None.
Fixed a Pillow error encountered during PNG image extraction.
Throws a validation error when JSON is passed with invalid unstructured JSON structure.
Resolved logger library warnings.
Fixed issue where chunking text resulted in an AttributeError ('NoneType' object has no attribute 'strip').
Bumped the `requests` package to address CVEs.

Affected Symbols

stage_for_label_studio sort_page_element pdfminer_utils.py

⚡ Deprecations

The function/method `stage_for_label_studio` is deprecated.