0.17.2
📦 unstructuredView on GitHub →
✨ 2 features🐛 1 fixes🔧 2 symbols
Summary
This release introduces the extraction of image URLs from HTML partitions and significantly speeds up hOCR parsing by switching to lxml. It also bumps the minimum required numpy version to greater than 2.
Migration Steps
- If you rely on the parsing speed of hOCR data, note that the underlying parser has changed from bs4 to lxml.
- Ensure your environment supports numpy version >2, as dependencies like paddlepaddle, unstructured-paddleocr, and onnx have been upgraded to maintain compatibility.
✨ New Features
- Added an "image_url" metadata field to extracted image elements when parsing HTML, containing the content of the src attribute for non-data URLs.
- Switched from using BeautifulSoup (bs4) to lxml for parsing hOCR data to improve performance.
🐛 Bug Fixes
- Fixed an issue where an image inside a <div> or <span> tag without any associated text was incorrectly categorized as "UncategorizedText" instead of being annotated as an Image.
🔧 Affected Symbols
html partitionerhOCR parser