Change8

3.4.0

Breaking Changes
📦 datasetsView on GitHub →
1 breaking4 features🐛 4 fixes🔧 6 symbols

Summary

This release introduces significant performance improvements for folder-based dataset building, including Parquet support and faster streaming via multithreading in `IterableDataset.decode`. A major breaking change involves replacing `decord` with `torchvision` for video loading.

⚠️ Breaking Changes

  • Replaced `decord` with `torchvision` for reading videos in folder-based builders due to `decord` being unmaintained and incompatible with recent Python versions. Users should update video loading logic to use `torchvision`.

Migration Steps

  1. If you were using `decord` for video loading, be aware that it has been replaced by `torchvision`. Check the video dataset loading documentation for details on the new loading mechanism.
  2. For faster streaming of image/audio/video folders from Hugging Face, consider using the new `dataset.decode(num_threads=num_threads)` method.

✨ New Features

  • Faster folder-based builder implementation with added support for Parquet metadata files (`metadata.parquet`) alongside existing CSV/JSONL formats.
  • Support for repeated media files when building datasets from folders.
  • Added `IterableDataset.decode` method with optional `num_threads` argument for multithreaded decoding of image/audio/video data during streaming.
  • Added `with_split` argument to `DatasetDict.map`.

🐛 Bug Fixes

  • Fixed a typing error that occurred when loading datasets with a boolean type that had a `None` default value.
  • Refactored `string_to_dict` to return `None` instead of raising a `ValueError` when no match is found.
  • Fixed small bugs related to asynchronous mapping operations.
  • Fixed resuming functionality after calling `ds.set_epoch(new_epoch)`.

🔧 Affected Symbols

decordtorchvisionIterableDataset.decodeDatasetDict.mapstring_to_dictVideo