Change8

4.2.0

📦 datasetsView on GitHub →
4 features🐛 1 fixes🔧 4 symbols

Summary

This release introduces significant enhancements for Parquet dataset handling, including better error management for bad files and advanced scanning options for efficiency. It also adds a new sampling strategy for dataset interleaving.

Migration Steps

  1. If you encounter issues with bad Parquet files during loading, consider using the new `on_bad_files` argument in `load_dataset`.
  2. For optimized Parquet loading, use the new `columns` and `filters` arguments in `load_dataset`.
  3. When streaming Parquet datasets, you can now fine-tune caching and buffering using the `fragment_scan_options` argument.

✨ New Features

  • Added option to sample without replacement when interleaving datasets using `stopping_strategy="all_exhausted_without_replacement"` in `interleave_datasets`.
  • Added `on_bad_files` argument to `load_dataset` when loading Parquet datasets to handle corrupted files by erroring, warning, or skipping them.
  • Added Parquet scan options to `load_dataset` allowing selection of specific columns (`columns`) and filtering data (`filters`) efficiently.
  • Added support for controlling buffering and caching during streaming of Parquet datasets via the new `fragment_scan_options` argument, accepting `pyarrow.dataset.ParquetFragmentScanOptions`.

🐛 Bug Fixes

  • Avoided some unnecessary copies in the PyTorch formatter.

🔧 Affected Symbols

interleave_datasetsload_datasetpyarrow.dataset.ParquetFragmentScanOptionspyarrow.CacheOptions