4.0.0

Breaking Changes

📅 Jul 9, 2025📦 datasetsView on GitHub →

⚠ 3 breaking✨ 4 features🐛 22 fixes⚡ 1 deprecations🔧 10 symbols

Summary

This release introduces significant new features like `push_to_hub` for streaming datasets and lazy column access via the new `Column` object. It also mandates a migration away from the legacy `Sequence` type to the new `List` type and updates audio/video decoding backends to use `torchcodec`.

⚠️ Breaking Changes

Removed scripts entirely; `trust_remote_code` is no longer supported.
Torchcodec replaces `soundfile` for audio decoding and `decord` for video decoding.
The `Sequence` type is replaced by `List`. `Sequence` is now a utility that returns a `List` or a `dict` depending on the subfeature structure, instead of being a feature type itself.

Migration Steps

If you relied on `trust_remote_code=True`, you must now handle code execution locally, as this option is removed.
Replace usage of `Sequence(Value("string"))` with `List(Value("string"))` in feature definitions.
If using audio/video decoding, note that `torchcodec` is now the default backend, requiring `torch>=2.7.0` and FFmpeg >= 4. Windows support for torchcodec decoding is pending.
If you were using `soundfile` or `decord` directly for decoding, be aware they are replaced by `torchcodec`.

✨ New Features

Added `IterableDataset.push_to_hub()` for building streaming data pipelines.
Added `num_proc=` argument to `.push_to_hub()` for both `Dataset` and `IterableDataset` to speed up uploads.
Introduced the `Column` object, enabling iteration over column values in `IterableDataset` and lazy loading of single cells (e.g., `ds["text"]`).
Implemented Torchcodec decoding, enabling streaming only required ranges/frames for audio/video data.

🐛 Bug Fixes

Refactored `Dataset.map` to reuse cache files mapped with different `num_proc` values.
Fixed string_to_dict test.
Preserved formatting in concatenated `IterableDataset`.
Fixed typos in PDF and Video documentation.
Added `embed_storage` in Pdf feature.
Fixed typing for `load_dataset` splits.
Fixed typos.
Fixed regex library warnings.
Fixed string_to_dict usage for Windows.
Removed TensorFlow tests on Windows.
Fixed parallel `push_to_hub` in `DatasetDict`.
Updated `_dill.py` to use `co_linetable` instead of `co_lnotab` for Python 3.10+.
Fixed various documentation issues.
Added Albumentations integration.
Raised error in `FolderBasedBuilder` when both `data_dir` and `data_files` are missing.
Fixed `save_infos`.
Improved features representation (`repr`).
Fixed length calculation for CI.
Fixed CI issues related to sequence handling.
Fixed inferring list of images.
Fixed audio bytes handling.
Fixed double sequence handling.

🔧 Affected Symbols

IterableDataset.push_to_hubDataset.push_to_hubColumnIterableColumndatasets.Sequencedatasets.ListtorchcodecsoundfiledecordDataset.map

⚡ Deprecations

The `Sequence` feature type is deprecated in favor of `List`.