Change8

4.0.0

Breaking Changes
📦 datasetsView on GitHub →
3 breaking4 features🐛 22 fixes1 deprecations🔧 10 symbols

Summary

This release introduces significant new features like `push_to_hub` for streaming datasets and lazy column access via the new `Column` object. It also mandates a migration away from the legacy `Sequence` type to the new `List` type and updates audio/video decoding backends to use `torchcodec`.

⚠️ Breaking Changes

  • Removed scripts entirely; `trust_remote_code` is no longer supported.
  • Torchcodec replaces `soundfile` for audio decoding and `decord` for video decoding.
  • The `Sequence` type is replaced by `List`. `Sequence` is now a utility that returns a `List` or a `dict` depending on the subfeature structure, instead of being a feature type itself.

Migration Steps

  1. If you relied on `trust_remote_code=True`, you must now handle code execution locally, as this option is removed.
  2. Replace usage of `Sequence(Value("string"))` with `List(Value("string"))` in feature definitions.
  3. If using audio/video decoding, note that `torchcodec` is now the default backend, requiring `torch>=2.7.0` and FFmpeg >= 4. Windows support for torchcodec decoding is pending.
  4. If you were using `soundfile` or `decord` directly for decoding, be aware they are replaced by `torchcodec`.

✨ New Features

  • Added `IterableDataset.push_to_hub()` for building streaming data pipelines.
  • Added `num_proc=` argument to `.push_to_hub()` for both `Dataset` and `IterableDataset` to speed up uploads.
  • Introduced the `Column` object, enabling iteration over column values in `IterableDataset` and lazy loading of single cells (e.g., `ds["text"]`).
  • Implemented Torchcodec decoding, enabling streaming only required ranges/frames for audio/video data.

🐛 Bug Fixes

  • Refactored `Dataset.map` to reuse cache files mapped with different `num_proc` values.
  • Fixed string_to_dict test.
  • Preserved formatting in concatenated `IterableDataset`.
  • Fixed typos in PDF and Video documentation.
  • Added `embed_storage` in Pdf feature.
  • Fixed typing for `load_dataset` splits.
  • Fixed typos.
  • Fixed regex library warnings.
  • Fixed string_to_dict usage for Windows.
  • Removed TensorFlow tests on Windows.
  • Fixed parallel `push_to_hub` in `DatasetDict`.
  • Updated `_dill.py` to use `co_linetable` instead of `co_lnotab` for Python 3.10+.
  • Fixed various documentation issues.
  • Added Albumentations integration.
  • Raised error in `FolderBasedBuilder` when both `data_dir` and `data_files` are missing.
  • Fixed `save_infos`.
  • Improved features representation (`repr`).
  • Fixed length calculation for CI.
  • Fixed CI issues related to sequence handling.
  • Fixed inferring list of images.
  • Fixed audio bytes handling.
  • Fixed double sequence handling.

🔧 Affected Symbols

IterableDataset.push_to_hubDataset.push_to_hubColumnIterableColumndatasets.Sequencedatasets.ListtorchcodecsoundfiledecordDataset.map

⚡ Deprecations

  • The `Sequence` feature type is deprecated in favor of `List`.