0.18.31
📦 unstructuredView on GitHub →
✨ 3 features🐛 7 fixes🔧 5 symbols
Summary
This release introduces Token-Based Chunking support and enhances PDF processing by patching pdfminer and considering rotated text. Several performance improvements and dependency updates were also included.
Migration Steps
- The default value for the languages parameter has been changed from ["auto"] to None. If you explicitly relied on the default behavior, you may need to update your code to pass languages=["auto"] if that behavior is still desired.
✨ New Features
- Patch pdfminer and use rendermode to detect invisible text.
- Consider rotated text as low fidelity/consider rotated text during processing.
- Token-Based Chunking Support.
🐛 Bug Fixes
- Add EN DASH to UNICODE_BULLETS for clean_bullets.
- Fix version number.
- Address jaraco CVE.
- Reduce default dpi to 350.
- Remove sandbox=True from pypandoc to fix ODT conversion.
- Filter coordinates kwargs to prevent TypeError in hi_res PDF processing.
- Preserve Line Breaks in Code Blocks During Chunking.