4.6.0
📦 datasetsView on GitHub →
✨ 5 features🐛 11 fixes⚡ 1 deprecations🔧 5 symbols
Summary
This release introduces major features for multimodal data handling, including native support for Image, Video, and Audio types in Lance datasets and enhanced deduplication capabilities via Xet storage during hub uploads. It also drops support for Python 3.9 and adds the ability to reshard IterableDatasets.
Migration Steps
- If you rely on Python 3.9, you must upgrade to Python 3.10 or newer.
✨ New Features
- Support for Image, Video, and Audio types in Lance datasets, including type inference from Lance blobs.
- push_to_hub() now supports Video types.
- Image/audio/video blobs are written as is (PLAIN) in parquet during push_to_hub() to enable cross-format Xet deduplication and faster uploads/downloads.
- Added IterableDataset.reshard() to split existing shards further, which works by sharding per row group for Parquet datasets.
- Added support for polars.Lazyframe in IterableDataset.from_x methods.
🐛 Bug Fixes
- Fixed load_from_disk progress bar when stdout is redirected.
- Reverted a change that avoided some copies in the torch formatter.
- Fixed interleave_datasets behavior with the all_exhausted_without_replacement strategy.
- Fixed handling of null values in json string columns.
- Fixed handling of blob data in Lance format.
- Fixed example counting in Lance datasets.
- Used temporary files in push_to_hub to save memory.
- Bumped fsspec upper bound to 2026.2.0 (fixes issue #7994).
- Made environment variable naming consistent (fixes issue #7998).
- Fixed support for empty shards in from_generator.
- Allowed importing polars within map() operations.
Affected Symbols
⚡ Deprecations
- Python 3.9 support has been dropped.