3.4.0

Breaking Changes

📅 Mar 14, 2025📦 datasetsView on GitHub →

⚠ 1 breaking✨ 4 features🐛 4 fixes🔧 6 symbols

Summary

This release introduces significant performance improvements for folder-based dataset building, including Parquet support and faster streaming via multithreading in `IterableDataset.decode`. A major breaking change involves replacing `decord` with `torchvision` for video loading.

⚠️ Breaking Changes

Replaced `decord` with `torchvision` for reading videos in folder-based builders due to `decord` being unmaintained and incompatible with recent Python versions. Users should update video loading logic to use `torchvision`.

Migration Steps

If you were using `decord` for video loading, be aware that it has been replaced by `torchvision`. Check the video dataset loading documentation for details on the new loading mechanism.
For faster streaming of image/audio/video folders from Hugging Face, consider using the new `dataset.decode(num_threads=num_threads)` method.

✨ New Features

Faster folder-based builder implementation with added support for Parquet metadata files (`metadata.parquet`) alongside existing CSV/JSONL formats.
Support for repeated media files when building datasets from folders.
Added `IterableDataset.decode` method with optional `num_threads` argument for multithreaded decoding of image/audio/video data during streaming.
Added `with_split` argument to `DatasetDict.map`.

🐛 Bug Fixes

Fixed a typing error that occurred when loading datasets with a boolean type that had a `None` default value.
Refactored `string_to_dict` to return `None` instead of raising a `ValueError` when no match is found.
Fixed small bugs related to asynchronous mapping operations.
Fixed resuming functionality after calling `ds.set_epoch(new_epoch)`.

🔧 Affected Symbols

decordtorchvisionIterableDataset.decodeDatasetDict.mapstring_to_dictVideo