0.17.11-dev1
Breaking Changes📦 unstructuredView on GitHub →
⚠ 1 breaking✨ 5 features🐛 11 fixes⚡ 1 deprecations🔧 3 symbols
Summary
This release drops support for Python 3.9 due to dependency conflicts and includes numerous bug fixes related to PDF parsing, stability, and security vulnerabilities. It also deprecates the `stage_for_label_studio` utility.
⚠️ Breaking Changes
- Dropped support for Python 3.9 due to dependency conflicts. Users must upgrade to Python 3.10 or newer.
Migration Steps
- If you are using Python 3.9, you must upgrade your environment to Python 3.10 or higher to continue using this version.
✨ New Features
- File prefix matching is now used to verify the presence of DOCX, PPTX, and XLSX files instead of relying on standard file names.
- Added convenience update for `unstructured-get-json.sh`.
- Added option to change the default output directory for `unstructured-get-json.sh`.
- Inference models have been bumped (version update implied).
- Recompiled on arm64 to meet minimum requirements.
🐛 Bug Fixes
- Hi-res PDF parsing now only extracts uncategorized text for extracted elements.
- Fixed unstable/random sorting in `sort_page_element` to ensure stable sorting.
- Addressed CVEs found in dependencies.
- Fixed failing build related to missing `diffstat` command in the test_json_to_html CI job.
- Fixed a build failure.
- Properly handles cases where an element's text is None.
- Fixed a Pillow error encountered during PNG image extraction.
- Throws a validation error when JSON is passed with invalid unstructured JSON structure.
- Resolved logger library warnings.
- Fixed issue where chunking text resulted in an AttributeError ('NoneType' object has no attribute 'strip').
- Bumped the `requests` package to address CVEs.
🔧 Affected Symbols
stage_for_label_studiosort_page_elementpdfminer_utils.py⚡ Deprecations
- The function/method `stage_for_label_studio` is deprecated.