Change8

py-1.38.0

📦 polarsView on GitHub →
27 features🐛 40 fixes1 deprecations🔧 38 symbols

Summary

This release focuses heavily on performance improvements across streaming, I/O, and core computations, alongside numerous bug fixes for stability and correctness. A key change is the deprecation of the `retries` argument in favor of using `storage_options`.

Migration Steps

  1. Replace usage of `retries=n` with `storage_options={"max_retries": n}`.

✨ New Features

  • Enable zero-copy object_store `put` upload for IPC sink.
  • Resolve file schema's and metadata concurrently.
  • Run elementwise CSEE for the streaming engine.
  • Disable morsel splitting for fast-count on streaming engine.
  • Implement streaming decompression for scan_ndjson and scan_lines.
  • Add dedicated kernel for group-by `arg_max/arg_min`.
  • Add streaming merge-join.
  • Generalize Bitmap::new_zeroed opt for Buffer::zeroed.
  • Avoid OOM for scan_ndjson and scan_lines if input is compressed and negative slice.
  • Support annoymous agg in-mem.
  • Add unstable `arrow_schema` parameter to `sink_parquet`.
  • Expose `upload_concurrency` through env var.
  • Allow quantile to compute multiple quantiles at once.
  • Allow empty LazyFrame in `LazyFrame.group_by(...).map_groups`.
  • Use delta file statistics for batch predicate pushdown.
  • Add streaming UnorderedUnion.
  • Implement compression support for sink_ndjson.
  • Add unstable record batch statistics flags to `{sink/scan}_ipc`.
  • Support CSE for python UDFs on the same address.
  • Cloud retry/backoff configuration via `storage_options`.
  • Add compression support to write_csv and sink_csv.
  • Add `scan_lines`.
  • Support regex in `str.split`.
  • Add unstable IPC Statistics read/write to `scan_ipc`/`sink_ipc`.
  • Add unstable `height` parameter to `DataFrame`/`LazyFrame`.
  • Expose ArrowStreamExportable on python collect batches iterator.
  • Add nulls support for all rolling_by operations.

🐛 Bug Fixes

  • Correct off-by-one in RLE row counting for nullable dictionary-encoded columns.
  • Support very large integers in env var limits.
  • Fix PlPath panic from incorrect slicing of UTF8 boundaries.
  • Fix Float dtype for spearman correlation.
  • Fix optimizer panic in right joins with type coercion.
  • Don't serialize retry config from local environment vars.
  • Fix `PartitionBy` with scalar key expressions and `diff()`.
  • Add {Float16, Float32} -> Float32 lossless upcast.
  • Fix panic using `with_columns` and `collect_all`.
  • Add multi-page support for writing dictionary-encoded Parquet columns.
  • Ensure slice advancement when skipping non-inlinable values in `is_in` with inlinable needles.
  • Bugs in ViewArray total_bytes_len.
  • Overflow in i128::abs in Decimal fits check.
  • Make Expr.hash on Categorical mapping-independent.
  • Clone shared GroupBy node before mutation in physical plan creation.
  • Fixed "sheet_name" typing for `read_ods` and `read_excel`.
  • Improve Polars dtype inference from Python `Union` typing.
  • Consider the "current location" of an item when computing `rolling_rank_by`.
  • Reset `is_count_star` flag between queries in collect_all.
  • Fix incorrect is_between filter on scan_parquet.
  • Make polars compatible with ty.
  • Lower AnonymousStreamingAgg in group-by as aggregate.
  • Avoid overflow in `pl.duration` scalar arguments case.
  • Broadcast arr.get on single array with multiple indices.
  • Fix panic on CSPE with sorts.
  • Eager `DataFrame.slice` with negative offset and `length=None`.
  • Use correct schema side for streaming merge join lowering.
  • Overflow panic in `scan_csv` with multiple files and `skip_rows + n_rows` larger than total row count.
  • Respect `allow_object` flag after cache.
  • Raise error on non-elementwise PartitionBy keys.
  • Allow ordered categorical dictionary in scan_parquet.
  • Allow excess bytes on IPC bitmap compressed length.
  • Fix deadlock on `hash_rows()` of 0-width DataFrame.
  • Fix NameError filtering pyarrow dataset.
  • Fix concat_arr panic when using categoricals/enums.
  • Fix NDJSON/scan_lines negative slice splitting with extremely long lines.
  • Incorrect group_by min/max fast path.
  • Remove a source of non-determinism from lowering.
  • Error when `with_row_index` or `unpivot` create duplicate columns on a `LazyFrame`.
  • Panics on shift with head.

Affected Symbols

⚡ Deprecations

  • Deprecate `retries=n` in favor of `storage_options={"max_retries": n}`.