3.4.0
Breaking Changes📦 datasetsView on GitHub →
⚠ 1 breaking✨ 4 features🐛 4 fixes🔧 6 symbols
Summary
This release introduces significant performance improvements for folder-based dataset building, including Parquet support and faster streaming via multithreading in `IterableDataset.decode`. A major breaking change involves replacing `decord` with `torchvision` for video loading.
⚠️ Breaking Changes
- Replaced `decord` with `torchvision` for reading videos in folder-based builders due to `decord` being unmaintained and incompatible with recent Python versions. Users should update video loading logic to use `torchvision`.
Migration Steps
- If you were using `decord` for video loading, be aware that it has been replaced by `torchvision`. Check the video dataset loading documentation for details on the new loading mechanism.
- For faster streaming of image/audio/video folders from Hugging Face, consider using the new `dataset.decode(num_threads=num_threads)` method.
✨ New Features
- Faster folder-based builder implementation with added support for Parquet metadata files (`metadata.parquet`) alongside existing CSV/JSONL formats.
- Support for repeated media files when building datasets from folders.
- Added `IterableDataset.decode` method with optional `num_threads` argument for multithreaded decoding of image/audio/video data during streaming.
- Added `with_split` argument to `DatasetDict.map`.
🐛 Bug Fixes
- Fixed a typing error that occurred when loading datasets with a boolean type that had a `None` default value.
- Refactored `string_to_dict` to return `None` instead of raising a `ValueError` when no match is found.
- Fixed small bugs related to asynchronous mapping operations.
- Fixed resuming functionality after calling `ds.set_epoch(new_epoch)`.
🔧 Affected Symbols
decordtorchvisionIterableDataset.decodeDatasetDict.mapstring_to_dictVideo