4.8.0
📦 datasetsView on GitHub →
✨ 4 features🐛 5 fixes🔧 6 symbols
Summary
This release introduces native support for reading and writing datasets to Hugging Face Storage Buckets and brings significant improvements and fixes to dataset streaming iterables. It also resolves a macOS segfault issue during multiprocessing pushes.
Migration Steps
- When using `push_to_hub` with streaming datasets, consider setting `max_shard_size` if necessary.
✨ New Features
- Added ability to read and write data directly from/to Hugging Face Storage Buckets using `buckets/username/data-bucket` or `hf://buckets/username/data-bucket/*.jsonl` paths.
- Added `max_shard_size` parameter to `IterableDataset.push_to_hub`.
- Improved arrow-native operations for `IterableDataset`.
- Enhanced support for glob patterns within archives, such as `zip://*.jsonl::hf://datasets/username/dataset-name/data.zip`.
🐛 Bug Fixes
- Fixed multiprocessed `push_to_hub` on macOS by switching from fork to spawn, resolving segfaults.
- Fixed `reshard_data_sources` functionality.
- Improved error message when an invalid `data_files` pattern format is provided.
- Fixed null filling behavior for missing columns in JSONL files.
- Fixed issues related to `to_pandas`, `videofolder`, and `load_dataset_builder` kwargs when using streaming iterables.