Change8

0.18.31

📦 unstructuredView on GitHub →
3 features🐛 7 fixes🔧 5 symbols

Summary

This release introduces Token-Based Chunking support and enhances PDF processing by patching pdfminer and considering rotated text. Several performance improvements and dependency updates were also included.

Migration Steps

  1. The default value for the languages parameter has been changed from ["auto"] to None. If you explicitly relied on the default behavior, you may need to update your code to pass languages=["auto"] if that behavior is still desired.

✨ New Features

  • Patch pdfminer and use rendermode to detect invisible text.
  • Consider rotated text as low fidelity/consider rotated text during processing.
  • Token-Based Chunking Support.

🐛 Bug Fixes

  • Add EN DASH to UNICODE_BULLETS for clean_bullets.
  • Fix version number.
  • Address jaraco CVE.
  • Reduce default dpi to 350.
  • Remove sandbox=True from pypandoc to fix ODT conversion.
  • Filter coordinates kwargs to prevent TypeError in hi_res PDF processing.
  • Preserve Line Breaks in Code Blocks During Chunking.

Affected Symbols