Unstructured

Data & ML

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Latest: 0.21.547 releases3 breaking changes4 common errorsUpdated Feb 24, 2026View on GitHub

Release History

0.21.51 fix1 feature

Feb 24, 2026

This release introduces a new feature for custom language detection fallbacks and resolves a dependency constraint issue with pdfminer.six.

0.21.2

Feb 23, 2026

0.21.1

Feb 22, 2026

This release primarily bumps the version number to 0.21.1.

0.21.0Breaking1 fix1 feature

Feb 22, 2026

Version 0.21.0 replaces the vulnerable NLTK dependency with spaCy to fix a critical RCE vulnerability (CVE-2025-14009) in the downloader mechanism.

0.20.82 fixes

Feb 20, 2026

This release primarily focuses on bug fixes, including setting the max decompressed size for elements JSON and updating dependencies.

0.20.63 fixes1 feature

Feb 19, 2026

This release focuses on improving stability and performance, including fixes for PDF rendering artifacts and HTML parsing errors, alongside automating the PyPI publishing process.

0.20.2

Feb 13, 2026

0.20.11 feature

Feb 12, 2026

This release introduces official support for Python 3.11 and 3.13, alongside infrastructure updates for image publishing.

0.19.34 fixes3 features

Feb 11, 2026

This release introduces a new utility function, improves PDF partitioning by preserving newlines in table elements, and resolves several build and image-related bugs, particularly for ARM64 environments. The project also migrated its dependency management to use uv.

0.18.321 feature

Feb 10, 2026

This release introduces a threadlock around the pdfium call for improved concurrency safety.

0.18.317 fixes3 features

Jan 27, 2026

This release introduces Token-Based Chunking support and enhances PDF processing by patching pdfminer and considering rotated text. Several performance improvements and dependency updates were also included.

0.18.28

Jan 9, 2026

0.18.272 fixes

Jan 8, 2026

This release focuses on performance optimizations across several internal functions and resolves a bug related to partially extracted elements. A critical dependency upgrade to pdfminer-six addresses a performance regression.

0.18.262 fixes

Jan 5, 2026

Version 0.18.26 pins deltalake to resolve ARM64 build issues, while 0.18.25 includes a security update by relaxing constraints and bumping pdfminer.six and urllib3.

0.18.241 fix1 feature

Dec 30, 2025

This release includes an optimization for OCR extraction and a critical security update by bumping dependencies.

0.18.221 fix

Dec 10, 2025

This patch release primarily addresses a security vulnerability by updating the fonttools dependency.

0.18.211 fix

Nov 24, 2025

This release primarily focuses on security by updating the "unstructured-inference" dependency to version 1.1.2 to address known CVEs.

0.18.202 features

Nov 15, 2025

This release improves the VoyageAI integration, adds support for voyage-context-3, and enhances metadata flagging for extracted elements.

0.18.184 fixes

Nov 7, 2025

This set of releases focuses heavily on security by addressing multiple CVEs through dependency bumps and fixing a critical path traversal vulnerability in MSG attachment handling. Additionally, performance was improved in hash ID assignment.

0.18.151 fix2 features

Sep 17, 2025

This release focuses on performance improvements for paragraph grouping and HTML element processing, alongside dependency updates to address security vulnerabilities.

0.18.142 fixes1 feature

Aug 26, 2025

This release focuses on performance improvements, notably speeding up the sentence_count function, and addresses several security vulnerabilities by updating dependencies.

0.18.131 fix

Aug 13, 2025

This patch improves the robustness of email parsing by handling more diverse date formats within email headers.

0.18.121 fix

Jul 28, 2025

This release improves error handling during encoding detection by replacing UnicodeDecodeError with UnprocessableEntityError to prevent logging large file contents.

0.18.112 features

Jul 23, 2025

This release introduces support for '|' as a CSV delimiter and adds mapping for <input> tags based on their type attribute. It also switches the dependency used for character set normalization.

0.18.101 feature

Jul 18, 2025

This minor release introduces a new environment variable, OCR_AGENT_CACHE_SIZE, for better memory management in OCR agents.

0.18.92 fixes1 feature

Jul 16, 2025

Version 0.18.9 introduces a new feature to convert elements to markdown and includes fixes for language detection with empty text and handling password-protected XLSX files.

0.18.71 feature

Jul 15, 2025

This release introduces language detection capabilities for PDF documents at both the document and element levels.

0.18.62 fixes

Jul 15, 2025

This release focuses on stability, improving EPUB partitioning error handling and correcting serialization types for TableChunks.

0.18.41 fix

Jul 8, 2025

This minor release primarily addresses an issue by increasing the field limit when parsing CSV files.

0.18.31 fix

Jul 5, 2025

This minor release bumps the pillow dependency to address a security vulnerability (CVE).

0.18.25 fixes

Jul 1, 2025

This release focuses on stability and correctness, addressing several bugs related to HTML parsing, XML escaping, Markdown encoding, and table processing.

0.18.11 fix1 feature

Jun 24, 2025

This release introduces the DocumentData element type for better handling of large document data and fixes an issue related to the encoding property in _CsvPartitioningContext.

0.17.11-dev1Breaking11 fixes5 features

Jun 13, 2025

This release drops support for Python 3.9 due to dependency conflicts and includes numerous bug fixes related to PDF parsing, stability, and security vulnerabilities. It also deprecates the `stage_for_label_studio` utility.

0.17.21 fix2 features

Mar 20, 2025

This release introduces the extraction of image URLs from HTML partitions and significantly speeds up hOCR parsing by switching to lxml. It also bumps the minimum required numpy version to greater than 2.

0.17.01 fix2 features

Mar 12, 2025

Version 0.17.0 introduces image inclusion during HTML partitioning and allows passing OCR/table agents, while removing the deprecated PageLayout.elements reference.

0.16.251 fix

Mar 7, 2025

This minor release primarily addresses a bug in filetype detection when handling JSON byte streams.

0.16.241 fix2 features

Mar 7, 2025

This release introduces dynamic file type registration for partitioners and enhances image block extraction for CamelCase element types. It also adds support for converting JSON elements to HTML.

0.16.231 fix

Feb 20, 2025

This minor release primarily addresses a bug in file type detection when handling SpooledTemporaryFile objects.

0.16.221 fix

Feb 20, 2025

This minor release focuses on security by addressing open CVES and updating underlying dependencies.

0.16.211 fix3 features

Feb 17, 2025

This release introduces password support for PDF loading and new configuration options for PDF Miner, alongside performance improvements in layout merging and a fix for NDJSON file detection.

0.16.201 fix

Feb 6, 2025

This release addresses a critical security vulnerability related to file inclusion in rst and org files by implementing sandboxing for file partitioning.

0.16.192 fixes

Feb 5, 2025

This release primarily focuses on fixing critical bugs related to table extraction logic and HTML partitioning within table cells. It also updates internal tooling configurations like `make tidy`.

0.16.171 fix1 feature

Jan 29, 2025

This minor release refactors the VoyageAI integration and fixes a bug related to text ordering in layout element construction.

0.16.162 fixes1 feature

Jan 27, 2025

This release introduces vectorized data structures for layout processing to improve performance and fixes an issue with NLTK auto-downloading and a patch in pdfminer that caused token splitting errors.

0.16.15

Jan 23, 2025

This release updates the versions of `unstructured-inference` and `pdfminer-six`, resulting in the removal of `layoutparser` related dependencies from `unstructured-inference`.

0.16.141 fix

Jan 20, 2025

This release primarily addresses a bug related to redundant passing of the 'infer_table_structure' argument during email partitioning with image attachments.

0.16.13Breaking1 fix1 feature

Jan 13, 2025

This release introduces character-level filtering for Tesseract output and resolves an issue with NLTK asset usage in Docker images, while removing automatic NLTK package downloading.

Common Errors

ModuleNotFoundError2 reports

A "ModuleNotFoundError" in unstructured usually means a required dependency is missing. To fix this, identify the missing module from the error message (e.g., 'unstructured_pytesseract') and install it using pip: `pip install <missing_module>`. For example: `pip install unstructured_pytesseract`.

UnsupportedFileFormatError2 reports

UnsupportedFileFormatError in unstructured usually occurs when the file's detected type doesn't match its actual content or when necessary dependencies for that file type are missing. Ensure the file extension is correct and corresponds to its content. Install any required packages for handling that specific file type, such as `pip install python-docx` for .docx files or `pip install pypdf` for PDFs, if they are not already installed.

FileNotFoundError2 reports

FileNotFoundError in unstructured often arises when the library can't locate necessary system dependencies (like Tesseract for OCR or LibreOffice for document conversion) required for processing specific file types. Ensure these dependencies are installed and accessible in your system's PATH environment variable. For example, install Tesseract using your system's package manager and verify LibreOffice is correctly installed if processing .docx files originating from cloud services like Office 365.

DocxLoadError1 report

DocxLoadError usually arises from corrupted or invalid .docx files that the `python-docx` library can't parse. To fix this, try opening the document in Microsoft Word or another compatible word processor and saving it again. Alternatively, use a try-except block to catch the error and skip processing problematic files, or pre-process documents to validate their .docx file structure.

Related Data & ML Packages

TensorFlow

An Open Source Machine Learning Framework for Everyone

Transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

PyTorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

scikit-learn

scikit-learn: machine learning in Python

Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Release History

Common Errors

Related Data & ML Packages

Subscribe to Updates