4.2.0
📦 datasetsView on GitHub →
✨ 4 features🐛 1 fixes🔧 4 symbols
Summary
This release introduces significant enhancements for Parquet dataset handling, including better error management for bad files and advanced scanning options for efficiency. It also adds a new sampling strategy for dataset interleaving.
Migration Steps
- If you encounter issues with bad Parquet files during loading, consider using the new `on_bad_files` argument in `load_dataset`.
- For optimized Parquet loading, use the new `columns` and `filters` arguments in `load_dataset`.
- When streaming Parquet datasets, you can now fine-tune caching and buffering using the `fragment_scan_options` argument.
✨ New Features
- Added option to sample without replacement when interleaving datasets using `stopping_strategy="all_exhausted_without_replacement"` in `interleave_datasets`.
- Added `on_bad_files` argument to `load_dataset` when loading Parquet datasets to handle corrupted files by erroring, warning, or skipping them.
- Added Parquet scan options to `load_dataset` allowing selection of specific columns (`columns`) and filtering data (`filters`) efficiently.
- Added support for controlling buffering and caching during streaming of Parquet datasets via the new `fragment_scan_options` argument, accepting `pyarrow.dataset.ParquetFragmentScanOptions`.
🐛 Bug Fixes
- Avoided some unnecessary copies in the PyTorch formatter.
🔧 Affected Symbols
interleave_datasetsload_datasetpyarrow.dataset.ParquetFragmentScanOptionspyarrow.CacheOptions