3.3.0

📅 Feb 14, 2025📦 datasetsView on GitHub →

✨ 3 features🐛 7 fixes🔧 4 symbols

Summary

This release introduces significant performance improvements for IterableDatasets, including support for async map operations and optimized processing using pandas/polars formats. It also adds a new repeat method for datasets.

Migration Steps

If using IterableDataset.map() with batched operations and pandas/polars, consider using .with_format("polars") or .with_format("pandas") for potential speedups.

✨ New Features

Support async functions in map() for datasets, useful for downloading content or calling inference APIs.
Add repeat method to datasets (e.g., ds.repeat(10)).
Support faster processing using pandas or polars functions in IterableDataset.map() by adding support for "pandas" and "polars" formats in IterableDatasets, enabling zero-copy optimized processing.

🐛 Bug Fixes

Don't import soundfile in tests.
Fix typo in arrow_dataset.
Remove filecheck to enable symlinks.
Handle Webdataset special columns in the last position.
Catch OSError for arrow operations.
Remove .h5 from imagefolder extensions.
Optimized sequence encoding for scalars.

🔧 Affected Symbols

IterableDataset.mapds.mapds.repeatds.with_format