Change8

Datasets

Data & ML

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Latest: 4.4.216 releases2 breaking changesView on GitHub →

Release History

4.4.28 fixes4 features
Dec 19, 2025

This release focuses on numerous bug fixes, particularly around NIfTI handling and fingerprinting, alongside minor additions like improved type inference for load_dataset and support for inspect_ai eval logs.

4.4.12 fixes
Nov 5, 2025

This patch release focuses on improving streaming reliability by enhancing retry logic for specific HTTP errors and cleaning up documentation for PDF and video features.

4.4.07 fixes3 features
Nov 4, 2025

This release introduces native support for loading NIfTI medical imaging datasets and allows explicit control over audio channel selection during casting. Several bug fixes address issues related to shuffling, array copying, and dependency compatibility.

4.3.04 fixes6 features
Oct 23, 2025

This release introduces significant improvements for large scale distributed dataset streaming, including better cache handling and file retries, alongside various bug fixes and feature enhancements.

4.2.01 fix4 features
Oct 9, 2025

This release introduces significant enhancements for Parquet dataset handling, including better error management for bad files and advanced scanning options for efficiency. It also adds a new sampling strategy for dataset interleaving.

4.1.12 fixes1 feature
Sep 18, 2025

This patch release introduces support for arrow iterables during concatenation/interleaving and resolves bugs related to nested field iteration and empty dataset parquet export.

4.1.018 fixes6 features
Sep 15, 2025

This release introduces significant performance improvements via content defined chunking for Parquet files and adds native support for loading HDF5 datasets. It also brings concurrent upload capabilities and various bug fixes across audio handling and dataset processing.

4.0.0Breaking22 fixes4 features
Jul 9, 2025

This release introduces significant new features like `push_to_hub` for streaming datasets and lazy column access via the new `Column` object. It also mandates a migration away from the legacy `Sequence` type to the new `List` type and updates audio/video decoding backends to use `torchcodec`.

3.6.02 fixes1 feature
May 7, 2025

This release introduces Xet storage support for faster hub operations and includes several bug fixes, notably resolving an issue with Image Features handling Spark DataFrame bytearrays.

3.5.13 fixes1 feature
Apr 28, 2025

This release focuses on bug fixes, notably supporting pyarrow 20, and includes several minor improvements and dependency updates.

3.5.03 fixes2 features
Mar 27, 2025

This release introduces native PDF support when loading datasets, allowing users to directly process PDF files. It also includes several minor fixes related to local loading and file handling.

3.4.11 fix
Mar 17, 2025

This patch release (3.4.1) primarily addresses a bug related to data_files filtering.

3.4.0Breaking4 fixes4 features
Mar 14, 2025

This release introduces significant performance improvements for folder-based dataset building, including Parquet support and faster streaming via multithreading in `IterableDataset.decode`. A major breaking change involves replacing `decord` with `torchvision` for video loading.

3.3.22 fixes
Feb 20, 2025

This patch release focuses on stability, fixing a multiprocessing hang and improving async task cancellation. It also includes minor documentation and typo corrections.

3.3.11 fix
Feb 17, 2025

This patch release primarily addresses a performance regression related to filtering operations.

3.3.07 fixes3 features
Feb 14, 2025

This release introduces significant performance improvements for IterableDatasets, including support for async map operations and optimized processing using pandas/polars formats. It also adds a new repeat method for datasets.