Change8

4.3.0

📦 datasetsView on GitHub →
6 features🐛 4 fixes🔧 2 symbols

Summary

This release introduces significant improvements for large scale distributed dataset streaming, including better cache handling and file retries, alongside various bug fixes and feature enhancements.

Migration Steps

  1. Ensure `huggingface_hub` is updated to version `>=1.1.0` to take full effect of streaming improvements.

✨ New Features

  • Enabled large scale distributed dataset streaming by keeping hffs cache in workers.
  • Added support for retrying opening hf files during streaming.
  • Added pyarrow's binary view to features.
  • Allowed streaming hdf5 files.
  • Added custom fingerprint support to `from_generator`.
  • Made `batch_fn` picklable.

🐛 Bug Fixes

  • Fixed conda dependencies.
  • Fixed polars cast column image.
  • Fixed batch_size default description in to_polars docstrings.
  • Documented dataset PDFs & OCR in documentation.

🔧 Affected Symbols

from_generatorto_polars