Datasets
Data & ML🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Release History
4.8.41 fix1 featureThis release introduces support for the latest torchvision version and resolves a regression bug related to loading single-object JSON files.
4.8.32 fixesThis patch release primarily addresses minor bugs, including fixes for the split_dataset_by_node step and the Json.cast_storage docstring.
4.8.21 featureThis minor release introduces support for the Json type when handling empty structs.
4.8.11 fixThis patch release fixes an issue related to the formatted iterator arrow double yield.
4.8.05 fixes4 featuresThis release introduces native support for reading and writing datasets to Hugging Face Storage Buckets and brings significant improvements and fixes to dataset streaming iterables. It also resolves a macOS segfault issue during multiprocessing pushes.
4.7.08 fixes3 featuresThis release introduces the `Json()` type to robustly handle mixed-type data structures, such as those found in tool calling datasets, and includes numerous bug fixes across iterable datasets and data processing pipelines.
4.6.11 fixThis patch release primarily addresses a bug related to temporary file cleanup during hub push operations.
4.6.011 fixes5 featuresThis release introduces major features for multimodal data handling, including native support for Image, Video, and Audio types in Lance datasets and enhanced deduplication capabilities via Xet storage during hub uploads. It also drops support for Python 3.9 and adds the ability to reshard IterableDatasets.
4.5.03 fixes1 featureThis release introduces native support for the Lance dataset format and includes several bug fixes, notably improving error handling for invalid revisions in `load_dataset`.
4.4.28 fixes4 featuresThis release focuses on numerous bug fixes, particularly around NIfTI handling and fingerprinting, alongside minor additions like improved type inference for load_dataset and support for inspect_ai eval logs.
4.4.12 fixesThis patch release focuses on improving streaming reliability by enhancing retry logic for specific HTTP errors and cleaning up documentation for PDF and video features.
4.4.07 fixes3 featuresThis release introduces native support for loading NIfTI medical imaging datasets and allows explicit control over audio channel selection during casting. Several bug fixes address issues related to shuffling, array copying, and dependency compatibility.
4.3.04 fixes6 featuresThis release introduces significant improvements for large scale distributed dataset streaming, including better cache handling and file retries, alongside various bug fixes and feature enhancements.
4.2.01 fix4 featuresThis release introduces significant enhancements for Parquet dataset handling, including better error management for bad files and advanced scanning options for efficiency. It also adds a new sampling strategy for dataset interleaving.
4.1.12 fixes1 featureThis patch release introduces support for arrow iterables during concatenation/interleaving and resolves bugs related to nested field iteration and empty dataset parquet export.
4.1.018 fixes6 featuresThis release introduces significant performance improvements via content defined chunking for Parquet files and adds native support for loading HDF5 datasets. It also brings concurrent upload capabilities and various bug fixes across audio handling and dataset processing.
4.0.0Breaking22 fixes4 featuresThis release introduces significant new features like `push_to_hub` for streaming datasets and lazy column access via the new `Column` object. It also mandates a migration away from the legacy `Sequence` type to the new `List` type and updates audio/video decoding backends to use `torchcodec`.
3.6.02 fixes1 featureThis release introduces Xet storage support for faster hub operations and includes several bug fixes, notably resolving an issue with Image Features handling Spark DataFrame bytearrays.
3.5.13 fixes1 featureThis release focuses on bug fixes, notably supporting pyarrow 20, and includes several minor improvements and dependency updates.
3.5.03 fixes2 featuresThis release introduces native PDF support when loading datasets, allowing users to directly process PDF files. It also includes several minor fixes related to local loading and file handling.
3.4.11 fixThis patch release (3.4.1) primarily addresses a bug related to data_files filtering.
3.4.0Breaking4 fixes4 featuresThis release introduces significant performance improvements for folder-based dataset building, including Parquet support and faster streaming via multithreading in `IterableDataset.decode`. A major breaking change involves replacing `decord` with `torchvision` for video loading.
3.3.22 fixesThis patch release focuses on stability, fixing a multiprocessing hang and improving async task cancellation. It also includes minor documentation and typo corrections.
3.3.11 fixThis patch release primarily addresses a performance regression related to filtering operations.
3.3.07 fixes3 featuresThis release introduces significant performance improvements for IterableDatasets, including support for async map operations and optimized processing using pandas/polars formats. It also adds a new repeat method for datasets.
Common Errors
FileNotFoundError8 reportsFileNotFoundError in datasets usually arises from incorrect file paths or missing files during dataset loading or processing. Ensure the specified file paths in your dataset loading script are accurate and that all required files exist at those locations. Double-check for typos, relative vs. absolute paths, and verify that files are accessible with the correct permissions.
ArrowNotImplementedError5 reportsArrowNotImplementedError in datasets often arises when the underlying Arrow library lacks support for casting between specific data types, especially for complex nested structures or multimedia data. To fix this, either upgrade to the latest version of `pyarrow` and `datasets` which may contain the necessary casting implementations, or restructure your dataset/features to use simpler Arrow-compatible datatypes, potentially involving manual serialization/deserialization where direct casting isn't available.
NotImplementedError5 reportsThe "NotImplementedError" in datasets often arises when a requested feature or function, like audio decoding or streaming, isn't fully implemented for a specific dataset format or within the current environment. To fix it, either install any missing dependencies required for the unimplemented functionality (e.g., `pip install librosa` for audio decoding) or switch to a dataset format or loading method that fully supports the desired feature in your environment. If using streaming, ensure your environment supports it fully, potentially requiring upgrades or specific configurations.
DatasetGenerationError4 reportsDatasetGenerationError often arises from unsupported data types or nested structures when creating or converting datasets, especially with chunked arrays or specific file formats like webdataset. Ensure that your data types are compatible with the dataset format and that you flatten or convert complex nested structures like deeply nested dictionaries or lists of dictionaries to simpler, serializable types like strings or numpy arrays before dataset creation. Consider using `datasets.features.Features` to explicitly define the expected schema and data types, enabling automatic type coercion during dataset creation.
UnicodeDecodeError3 reportsUnicodeDecodeError often arises when reading files with an encoding different from the expected one (usually UTF-8). Specify the correct encoding when opening the file using the `encoding` parameter (e.g., `encoding='latin-1'` or `encoding='utf-8'`) in the dataset loading function or file reading operation. If the file contains broken or invalid characters, consider using `errors='ignore'` or `errors='replace'` to skip or replace problematic characters during decoding.
EntryNotFoundError2 reportsThe "EntryNotFoundError" in datasets usually means the specified dataset script (like superb.py or common_voice_11_0.py) is either missing or not correctly named/located within the dataset directory. Verify the dataset script exists in the expected location, double-check its filename for any typos, and ensure it's a correctly formatted Python script; if the dataset is deprecated, consider using a more updated version or alternative data source.
Related Data & ML Packages
An Open Source Machine Learning Framework for Everyone
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
scikit-learn: machine learning in Python
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Streamlit — A faster way to build and share data apps.
Subscribe to Updates
Get notified when new versions are released