3.3.0
📦 datasetsView on GitHub →
✨ 3 features🐛 7 fixes🔧 4 symbols
Summary
This release introduces significant performance improvements for IterableDatasets, including support for async map operations and optimized processing using pandas/polars formats. It also adds a new repeat method for datasets.
Migration Steps
- If using IterableDataset.map() with batched operations and pandas/polars, consider using .with_format("polars") or .with_format("pandas") for potential speedups.
✨ New Features
- Support async functions in map() for datasets, useful for downloading content or calling inference APIs.
- Add repeat method to datasets (e.g., ds.repeat(10)).
- Support faster processing using pandas or polars functions in IterableDataset.map() by adding support for "pandas" and "polars" formats in IterableDatasets, enabling zero-copy optimized processing.
🐛 Bug Fixes
- Don't import soundfile in tests.
- Fix typo in arrow_dataset.
- Remove filecheck to enable symlinks.
- Handle Webdataset special columns in the last position.
- Catch OSError for arrow operations.
- Remove .h5 from imagefolder extensions.
- Optimized sequence encoding for scalars.
🔧 Affected Symbols
IterableDataset.mapds.mapds.repeatds.with_format