4.0.0
Breaking Changes📦 datasetsView on GitHub →
⚠ 3 breaking✨ 4 features🐛 22 fixes⚡ 1 deprecations🔧 10 symbols
Summary
This release introduces significant new features like `push_to_hub` for streaming datasets and lazy column access via the new `Column` object. It also mandates a migration away from the legacy `Sequence` type to the new `List` type and updates audio/video decoding backends to use `torchcodec`.
⚠️ Breaking Changes
- Removed scripts entirely; `trust_remote_code` is no longer supported.
- Torchcodec replaces `soundfile` for audio decoding and `decord` for video decoding.
- The `Sequence` type is replaced by `List`. `Sequence` is now a utility that returns a `List` or a `dict` depending on the subfeature structure, instead of being a feature type itself.
Migration Steps
- If you relied on `trust_remote_code=True`, you must now handle code execution locally, as this option is removed.
- Replace usage of `Sequence(Value("string"))` with `List(Value("string"))` in feature definitions.
- If using audio/video decoding, note that `torchcodec` is now the default backend, requiring `torch>=2.7.0` and FFmpeg >= 4. Windows support for torchcodec decoding is pending.
- If you were using `soundfile` or `decord` directly for decoding, be aware they are replaced by `torchcodec`.
✨ New Features
- Added `IterableDataset.push_to_hub()` for building streaming data pipelines.
- Added `num_proc=` argument to `.push_to_hub()` for both `Dataset` and `IterableDataset` to speed up uploads.
- Introduced the `Column` object, enabling iteration over column values in `IterableDataset` and lazy loading of single cells (e.g., `ds["text"]`).
- Implemented Torchcodec decoding, enabling streaming only required ranges/frames for audio/video data.
🐛 Bug Fixes
- Refactored `Dataset.map` to reuse cache files mapped with different `num_proc` values.
- Fixed string_to_dict test.
- Preserved formatting in concatenated `IterableDataset`.
- Fixed typos in PDF and Video documentation.
- Added `embed_storage` in Pdf feature.
- Fixed typing for `load_dataset` splits.
- Fixed typos.
- Fixed regex library warnings.
- Fixed string_to_dict usage for Windows.
- Removed TensorFlow tests on Windows.
- Fixed parallel `push_to_hub` in `DatasetDict`.
- Updated `_dill.py` to use `co_linetable` instead of `co_lnotab` for Python 3.10+.
- Fixed various documentation issues.
- Added Albumentations integration.
- Raised error in `FolderBasedBuilder` when both `data_dir` and `data_files` are missing.
- Fixed `save_infos`.
- Improved features representation (`repr`).
- Fixed length calculation for CI.
- Fixed CI issues related to sequence handling.
- Fixed inferring list of images.
- Fixed audio bytes handling.
- Fixed double sequence handling.
🔧 Affected Symbols
IterableDataset.push_to_hubDataset.push_to_hubColumnIterableColumndatasets.Sequencedatasets.ListtorchcodecsoundfiledecordDataset.map⚡ Deprecations
- The `Sequence` feature type is deprecated in favor of `List`.