4.1.0
📦 datasetsView on GitHub →
✨ 6 features🐛 18 fixes🔧 5 symbols
Summary
This release introduces significant performance improvements via content defined chunking for Parquet files and adds native support for loading HDF5 datasets. It also brings concurrent upload capabilities and various bug fixes across audio handling and dataset processing.
Migration Steps
- If you rely on specific behavior of `num_proc=1`, note that it now uses one worker instead of the main process. If you intended to run processing sequentially on the main process, use `num_proc=None` (or omit the argument).
- If you are using audio features, be aware that encoding now relies on `TorchCodec` instead of `Soundfile`.
✨ New Features
- Enabled content defined chunking when writing Parquet files, which optimizes Parquet for Xet storage backend by defining page boundaries based on data content for easier deduplication.
- Introduced concurrent `push_to_hub` functionality for datasets.
- Added concurrent `push_to_hub` support for `IterableDataset`.
- Added support for loading HDF5 datasets directly via `load_dataset`.
- Audio encoding now uses `TorchCodec` instead of `Soundfile`.
- Added support for `pathlib.Path` as input for features.
🐛 Bug Fixes
- Improved performance and conversion to string when needed for .zstd compression.
- Fixed audio cast storage related to array and sampling_rate.
- Corrected misleading `add_column()` usage example in docstring.
- Allowed dataset row indexing with NumPy integer types.
- Updated `fsspec` maximum version constraint to 2025.7.0.
- Updated `push_to_hub` logic for `DatasetDict`.
- Added retry mechanism for intermediate commits during uploads.
- Clarified documentation for `num_proc`: `num_proc=0` now behaves like `None`, and `num_proc=1` uses one worker (not the main process).
- Fixed CI test related to `num_proc=1`.
- Updated PNG/JPEG depth map documentation to use `Image(mode="F")`.
- Fixed representation for `largelist` objects.
- Corrected grammar in `fingerprint.py` ("showed" to "shown").
- Fixed type hint for `train_test_split`.
- Prevented `.lower()` operation on `field_name` in WebDataset handling.
- Refactored HDF5 loading to preserve tree structure.
- Added column overwrite example to the batch mapping guide.
- Fixed typo in the error message for cache directory deletion.
- Added support for pyarrow string view in features.
🔧 Affected Symbols
ds.push_to_hubload_datasetadd_column()train_test_splitfingerprint.py