Change8

4.8.0

📦 datasetsView on GitHub →
4 features🐛 5 fixes🔧 6 symbols

Summary

This release introduces native support for reading and writing datasets to Hugging Face Storage Buckets and brings significant improvements and fixes to dataset streaming iterables. It also resolves a macOS segfault issue during multiprocessing pushes.

Migration Steps

  1. When using `push_to_hub` with streaming datasets, consider setting `max_shard_size` if necessary.

✨ New Features

  • Added ability to read and write data directly from/to Hugging Face Storage Buckets using `buckets/username/data-bucket` or `hf://buckets/username/data-bucket/*.jsonl` paths.
  • Added `max_shard_size` parameter to `IterableDataset.push_to_hub`.
  • Improved arrow-native operations for `IterableDataset`.
  • Enhanced support for glob patterns within archives, such as `zip://*.jsonl::hf://datasets/username/dataset-name/data.zip`.

🐛 Bug Fixes

  • Fixed multiprocessed `push_to_hub` on macOS by switching from fork to spawn, resolving segfaults.
  • Fixed `reshard_data_sources` functionality.
  • Improved error message when an invalid `data_files` pattern format is provided.
  • Fixed null filling behavior for missing columns in JSONL files.
  • Fixed issues related to `to_pandas`, `videofolder`, and `load_dataset_builder` kwargs when using streaming iterables.

Affected Symbols