0.17.2

📅 Mar 20, 2025📦 unstructuredView on GitHub →

✨ 2 features🐛 1 fixes🔧 2 symbols

Summary

This release introduces the extraction of image URLs from HTML partitions and significantly speeds up hOCR parsing by switching to lxml. It also bumps the minimum required numpy version to greater than 2.

Migration Steps

If you rely on the parsing speed of hOCR data, note that the underlying parser has changed from bs4 to lxml.
Ensure your environment supports numpy version >2, as dependencies like paddlepaddle, unstructured-paddleocr, and onnx have been upgraded to maintain compatibility.

✨ New Features

Added an "image_url" metadata field to extracted image elements when parsing HTML, containing the content of the src attribute for non-data URLs.
Switched from using BeautifulSoup (bs4) to lxml for parsing hOCR data to improve performance.

🐛 Bug Fixes

Fixed an issue where an image inside a <div> or <span> tag without any associated text was incorrectly categorized as "UncategorizedText" instead of being annotated as an Image.

🔧 Affected Symbols

html partitionerhOCR parser