Unstructured
Data & MLConvert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Release History
0.18.280.18.272 fixesThis release focuses on performance optimizations across several internal functions and resolves a bug related to partially extracted elements. A critical dependency upgrade to pdfminer-six addresses a performance regression.
0.18.262 fixesVersion 0.18.26 pins deltalake to resolve ARM64 build issues, while 0.18.25 includes a security update by relaxing constraints and bumping pdfminer.six and urllib3.
0.18.241 fix1 featureThis release includes an optimization for OCR extraction and a critical security update by bumping dependencies.
0.18.221 fixThis patch release primarily addresses a security vulnerability by updating the fonttools dependency.
0.18.211 fixThis release primarily focuses on security by updating the "unstructured-inference" dependency to version 1.1.2 to address known CVEs.
0.18.202 featuresThis release improves the VoyageAI integration, adds support for voyage-context-3, and enhances metadata flagging for extracted elements.
0.18.184 fixesThis set of releases focuses heavily on security by addressing multiple CVEs through dependency bumps and fixing a critical path traversal vulnerability in MSG attachment handling. Additionally, performance was improved in hash ID assignment.
0.18.151 fix2 featuresThis release focuses on performance improvements for paragraph grouping and HTML element processing, alongside dependency updates to address security vulnerabilities.
0.18.142 fixes1 featureThis release focuses on performance improvements, notably speeding up the sentence_count function, and addresses several security vulnerabilities by updating dependencies.
0.18.131 fixThis patch improves the robustness of email parsing by handling more diverse date formats within email headers.
0.18.121 fixThis release improves error handling during encoding detection by replacing UnicodeDecodeError with UnprocessableEntityError to prevent logging large file contents.
0.18.112 featuresThis release introduces support for '|' as a CSV delimiter and adds mapping for <input> tags based on their type attribute. It also switches the dependency used for character set normalization.
0.18.101 featureThis minor release introduces a new environment variable, OCR_AGENT_CACHE_SIZE, for better memory management in OCR agents.
0.18.92 fixes1 featureVersion 0.18.9 introduces a new feature to convert elements to markdown and includes fixes for language detection with empty text and handling password-protected XLSX files.
0.18.71 featureThis release introduces language detection capabilities for PDF documents at both the document and element levels.
0.18.62 fixesThis release focuses on stability, improving EPUB partitioning error handling and correcting serialization types for TableChunks.
0.18.41 fixThis minor release primarily addresses an issue by increasing the field limit when parsing CSV files.
0.18.31 fixThis minor release bumps the pillow dependency to address a security vulnerability (CVE).
0.18.25 fixesThis release focuses on stability and correctness, addressing several bugs related to HTML parsing, XML escaping, Markdown encoding, and table processing.
0.18.11 fix1 featureThis release introduces the DocumentData element type for better handling of large document data and fixes an issue related to the encoding property in _CsvPartitioningContext.
0.17.11-dev1Breaking11 fixes5 featuresThis release drops support for Python 3.9 due to dependency conflicts and includes numerous bug fixes related to PDF parsing, stability, and security vulnerabilities. It also deprecates the `stage_for_label_studio` utility.
0.17.21 fix2 featuresThis release introduces the extraction of image URLs from HTML partitions and significantly speeds up hOCR parsing by switching to lxml. It also bumps the minimum required numpy version to greater than 2.
0.17.01 fix2 featuresVersion 0.17.0 introduces image inclusion during HTML partitioning and allows passing OCR/table agents, while removing the deprecated PageLayout.elements reference.
0.16.251 fixThis minor release primarily addresses a bug in filetype detection when handling JSON byte streams.
0.16.241 fix2 featuresThis release introduces dynamic file type registration for partitioners and enhances image block extraction for CamelCase element types. It also adds support for converting JSON elements to HTML.
0.16.231 fixThis minor release primarily addresses a bug in file type detection when handling SpooledTemporaryFile objects.
0.16.221 fixThis minor release focuses on security by addressing open CVES and updating underlying dependencies.
0.16.211 fix3 featuresThis release introduces password support for PDF loading and new configuration options for PDF Miner, alongside performance improvements in layout merging and a fix for NDJSON file detection.
0.16.201 fixThis release addresses a critical security vulnerability related to file inclusion in rst and org files by implementing sandboxing for file partitioning.
0.16.192 fixesThis release primarily focuses on fixing critical bugs related to table extraction logic and HTML partitioning within table cells. It also updates internal tooling configurations like `make tidy`.
0.16.171 fix1 featureThis minor release refactors the VoyageAI integration and fixes a bug related to text ordering in layout element construction.
0.16.162 fixes1 featureThis release introduces vectorized data structures for layout processing to improve performance and fixes an issue with NLTK auto-downloading and a patch in pdfminer that caused token splitting errors.
0.16.15This release updates the versions of `unstructured-inference` and `pdfminer-six`, resulting in the removal of `layoutparser` related dependencies from `unstructured-inference`.
0.16.141 fixThis release primarily addresses a bug related to redundant passing of the 'infer_table_structure' argument during email partitioning with image attachments.
0.16.13Breaking1 fix1 featureThis release introduces character-level filtering for Tesseract output and resolves an issue with NLTK asset usage in Docker images, while removing automatic NLTK package downloading.