4.3.0
📦 datasetsView on GitHub →
✨ 6 features🐛 4 fixes🔧 2 symbols
Summary
This release introduces significant improvements for large scale distributed dataset streaming, including better cache handling and file retries, alongside various bug fixes and feature enhancements.
Migration Steps
- Ensure `huggingface_hub` is updated to version `>=1.1.0` to take full effect of streaming improvements.
✨ New Features
- Enabled large scale distributed dataset streaming by keeping hffs cache in workers.
- Added support for retrying opening hf files during streaming.
- Added pyarrow's binary view to features.
- Allowed streaming hdf5 files.
- Added custom fingerprint support to `from_generator`.
- Made `batch_fn` picklable.
🐛 Bug Fixes
- Fixed conda dependencies.
- Fixed polars cast column image.
- Fixed batch_size default description in to_polars docstrings.
- Documented dataset PDFs & OCR in documentation.
🔧 Affected Symbols
from_generatorto_polars