4.7.0
📦 datasetsView on GitHub →
✨ 3 features🐛 8 fixes🔧 10 symbols
Summary
This release introduces the `Json()` type to robustly handle mixed-type data structures, such as those found in tool calling datasets, and includes numerous bug fixes across iterable datasets and data processing pipelines.
Migration Steps
- When dealing with fields or subfields containing mixed types (e.g., mix of str/int/float/dict/list or dictionaries with arbitrary keys), use `Json()` in your `Features` definition or use `on_mixed_types="use_json"` during dataset loading/mapping/creation to prevent Arrow conversion errors.
✨ New Features
- Added support for JSON Lines files containing arbitrary JSON objects (like tool calling datasets) via the new `Json()` type in Features.
- The `Json()` type can now be used in `Features()` for any dataset and is supported in functions like `load_dataset()`, `.map()`, `.cast()`, `.from_dict()`, and `.from_list()`.
- Introduced `on_mixed_types="use_json"` argument to automatically set the `Json()` type on mixed types during `.from_dict()`, `.from_list()`, and `.map()` operations.
🐛 Bug Fixes
- Fixed typos in iterable_dataset.py.
- Fixed non-deterministic behavior by sorting metadata extensions.
- Used `num_examples` instead of `len(self)` for iterable_dataset's SplitInfo.
- Fixed silent data loss during `push_to_hub` when `num_proc > num_shards`.
- Fixed issue where bad files were extracted.
- Preserved features when chaining `filter()` on a typed IterableDataset.
- Fixed handling of nested null types in feature alignment for multi-process map operations.
- Fixed unstable tokenizer fingerprinting, enabling map cache reuse.