4.2.0

📅 Oct 9, 2025📦 datasetsView on GitHub →

✨ 4 features🐛 1 fixes🔧 4 symbols

Summary

This release introduces significant enhancements for Parquet dataset handling, including better error management for bad files and advanced scanning options for efficiency. It also adds a new sampling strategy for dataset interleaving.

Migration Steps

If you encounter issues with bad Parquet files during loading, consider using the new `on_bad_files` argument in `load_dataset`.
For optimized Parquet loading, use the new `columns` and `filters` arguments in `load_dataset`.
When streaming Parquet datasets, you can now fine-tune caching and buffering using the `fragment_scan_options` argument.

✨ New Features

Added option to sample without replacement when interleaving datasets using `stopping_strategy="all_exhausted_without_replacement"` in `interleave_datasets`.
Added `on_bad_files` argument to `load_dataset` when loading Parquet datasets to handle corrupted files by erroring, warning, or skipping them.
Added Parquet scan options to `load_dataset` allowing selection of specific columns (`columns`) and filtering data (`filters`) efficiently.
Added support for controlling buffering and caching during streaming of Parquet datasets via the new `fragment_scan_options` argument, accepting `pyarrow.dataset.ParquetFragmentScanOptions`.

🐛 Bug Fixes

Avoided some unnecessary copies in the PyTorch formatter.

🔧 Affected Symbols

interleave_datasetsload_datasetpyarrow.dataset.ParquetFragmentScanOptionspyarrow.CacheOptions