Change8

3.3.0

📦 datasetsView on GitHub →
3 features🐛 7 fixes🔧 4 symbols

Summary

This release introduces significant performance improvements for IterableDatasets, including support for async map operations and optimized processing using pandas/polars formats. It also adds a new repeat method for datasets.

Migration Steps

  1. If using IterableDataset.map() with batched operations and pandas/polars, consider using .with_format("polars") or .with_format("pandas") for potential speedups.

✨ New Features

  • Support async functions in map() for datasets, useful for downloading content or calling inference APIs.
  • Add repeat method to datasets (e.g., ds.repeat(10)).
  • Support faster processing using pandas or polars functions in IterableDataset.map() by adding support for "pandas" and "polars" formats in IterableDatasets, enabling zero-copy optimized processing.

🐛 Bug Fixes

  • Don't import soundfile in tests.
  • Fix typo in arrow_dataset.
  • Remove filecheck to enable symlinks.
  • Handle Webdataset special columns in the last position.
  • Catch OSError for arrow operations.
  • Remove .h5 from imagefolder extensions.
  • Optimized sequence encoding for scalars.

🔧 Affected Symbols

IterableDataset.mapds.mapds.repeatds.with_format