Change8

4.7.0

📦 datasetsView on GitHub →
3 features🐛 8 fixes🔧 10 symbols

Summary

This release introduces the `Json()` type to robustly handle mixed-type data structures, such as those found in tool calling datasets, and includes numerous bug fixes across iterable datasets and data processing pipelines.

Migration Steps

  1. When dealing with fields or subfields containing mixed types (e.g., mix of str/int/float/dict/list or dictionaries with arbitrary keys), use `Json()` in your `Features` definition or use `on_mixed_types="use_json"` during dataset loading/mapping/creation to prevent Arrow conversion errors.

✨ New Features

  • Added support for JSON Lines files containing arbitrary JSON objects (like tool calling datasets) via the new `Json()` type in Features.
  • The `Json()` type can now be used in `Features()` for any dataset and is supported in functions like `load_dataset()`, `.map()`, `.cast()`, `.from_dict()`, and `.from_list()`.
  • Introduced `on_mixed_types="use_json"` argument to automatically set the `Json()` type on mixed types during `.from_dict()`, `.from_list()`, and `.map()` operations.

🐛 Bug Fixes

  • Fixed typos in iterable_dataset.py.
  • Fixed non-deterministic behavior by sorting metadata extensions.
  • Used `num_examples` instead of `len(self)` for iterable_dataset's SplitInfo.
  • Fixed silent data loss during `push_to_hub` when `num_proc > num_shards`.
  • Fixed issue where bad files were extracted.
  • Preserved features when chaining `filter()` on a typed IterableDataset.
  • Fixed handling of nested null types in feature alignment for multi-process map operations.
  • Fixed unstable tokenizer fingerprinting, enabling map cache reuse.

Affected Symbols