4.8.0

📅 Mar 16, 2026📦 datasetsView on GitHub →

✨ 4 features🐛 5 fixes🔧 6 symbols

Summary

This release introduces native support for reading and writing datasets to Hugging Face Storage Buckets and brings significant improvements and fixes to dataset streaming iterables. It also resolves a macOS segfault issue during multiprocessing pushes.

Migration Steps

When using `push_to_hub` with streaming datasets, consider setting `max_shard_size` if necessary.

✨ New Features

Added ability to read and write data directly from/to Hugging Face Storage Buckets using `buckets/username/data-bucket` or `hf://buckets/username/data-bucket/*.jsonl` paths.
Added `max_shard_size` parameter to `IterableDataset.push_to_hub`.
Improved arrow-native operations for `IterableDataset`.
Enhanced support for glob patterns within archives, such as `zip://*.jsonl::hf://datasets/username/dataset-name/data.zip`.

🐛 Bug Fixes

Fixed multiprocessed `push_to_hub` on macOS by switching from fork to spawn, resolving segfaults.
Fixed `reshard_data_sources` functionality.
Improved error message when an invalid `data_files` pattern format is provided.
Fixed null filling behavior for missing columns in JSONL files.
Fixed issues related to `to_pandas`, `videofolder`, and `load_dataset_builder` kwargs when using streaming iterables.

Affected Symbols

IterableDataset.push_to_hub load_dataset push_to_hub to_pandas videofolder load_dataset_builder