Change8

py-1.32.0-beta.1

📦 polars
31 features🐛 78 fixes🔧 41 symbols

Summary

This release focuses heavily on performance improvements across the streaming engine, including lowering various operations and optimizing predicate pushdown. Key enhancements include the formalization of `Selector` in the DSL and reworking Categorical/Enum handling using (Frozen)Categories.

Migration Steps

  1. Raise and Warn on UDF's without `return_dtype` set (consider setting it explicitly).

✨ New Features

  • Make `Selector` a concrete part of the DSL.
  • Rework Categorical/Enum to use (Frozen)Categories.
  • Add Python-side caching for credentials and provider auto-initialization.
  • Expand on `DataTypeExpr`.
  • Add scalar checks to range expressions.
  • Expose `POLARS_DOT_SVG_VIEWER` to automatically dispatch to SVG viewer.
  • Implement mean function in `arr` namespace.
  • Implement `vec_hash` for `List` and `Array`.
  • Add unstable `pl.row_index()` expression.
  • Add Categories on the Python side.
  • Implement partitioned sinks for the in-memory engine.
  • IR pruning.
  • Support min/max reducer for null dtype in streaming engine.
  • Implement streaming Categorical/Enum min/max.
  • Allow cast to Categorical inside list.eval.
  • Support `pathlib.Path` as source for `read/scan_delta()`.
  • Enable default set of `ScanCastOptions` for native `scan_iceberg()`.
  • Pass payload in `ExprRegistry`.
  • Support reading nanosecond/Int96 timestamps and schema evolved datasets in `scan_delta()`.
  • Support row group skipping with filters when `cast_options` is given.
  • Execute bitwise reductions in streaming engine.
  • Use `scan_parquet().collect_schema()` for `read_parquet_schema`.
  • Add dtype to str.to\_integer().
  • Add `arr.slice`, `arr.head` and `arr.tail` methods to `arr` namespace.
  • Add `is_close` method.
  • Drop superfluous casts from optimized plan.
  • Added `drop_nulls` option to `to_dummies`.
  • Support comma as decimal separator for CSV write.
  • Don't format keys if they're empty in dot.
  • Improve arity simplification.
  • Allow expression input for `length` parameter in `pad_start`, `pad_end`, and `zfill`.

🐛 Bug Fixes

  • Load `_expiry_time` from botocore `Credentials` in CredentialProviderAWS.
  • Fix credential refresh logic.
  • Fix `to_datetime()` fallible identification.
  • Correct output datatype for `dt.with_time_unit`.
  • Fix incorrect native Iceberg scan from tables with renamed/dropped columns/fields.
  • Allow DataType expressions with selectors.
  • Match output type to engine for `interpolate` on `Decimal`.
  • Remaining bugs in `with_exprs_and_input` and pruning.
  • Match output dtype to engine for `cum_sum_horizontal`.
  • Field names for `pl.struct` in group-by.
  • Fix output for `str.extract_groups` with empty string pattern.
  • Match output type to engine for `rolling_map`.
  • Moved passing `DeltaTable._storage_options`.
  • Fix incorrect join on single Int128 column for in-memory engine.
  • Match output field name to lhs for `BusinessDaycount`.
  • Correct the planner output datatype for `strptime`.
  • Sort and Scan `with_exprs_and_input`.
  • Revert to old behavior with `name.keep`.
  • Fix panic loading from arrow `Map` containing timestamps.
  • Selectors in `self` part of `list.eval`.
  • Fix output field dtype for `ToInteger`.
  • Allow `decimal_comma` with `,` separator in `read_csv`.
  • Fix handling of UTF-8 in `write_csv` to `IO[str]`.
  • Selectors in `{Lazy,Data}Frame.filter`.
  • Stop splitfields iterator at eol in simd branch.
  • Correct output datatype of dt.year and dt.mil.
  • Logic of broadcast\_rhs in binary functions to correct list.set\_intersection for list[str] columns.
  • Order-preserving equi-join didn't always flush final matches.
  • Fix ColumnNotFound error when joining on `col().cast()`.
  • Fix agg groups on `when/then` in `group_by` context.
  • Output type for sign.
  • Apply `agg_fn` on `null` values in `pivot`.
  • Remove nonsensical duration variance.
  • Don't panic when sinking nested categorical to Parquet.
  • Correctly set value count output field name.
  • Casting unused columns in to\_torch.
  • Allow inferring of hours-only timezone offset.
  • Bug in Categorical <-> str compare with nulls.
  • Honor `n=0` in all cases of `str.replace`.
  • Remove arbitrary 25 item limit from implicit Python list -> Series infer.
  • Relabel duplicate sequence IDs in distributor.
  • Round-trip Enum and Categorical metadata in plugins.
  • Fix incorrect `join_asof` with `by` followed by `head/slice`.
  • Allow writing nested Int128 data to Parquet.
  • Enum serialization assert.
  • Output type for peak\_min / peak\_max.
  • Make Scalar Categorical, Enum and Struct values serializable.
  • Preserve row order within partition when sinking parquet.
  • Prevent in-mem partition sink deadlock.
  • Update AWS cloud documentation.
  • Correctly handle null values when comparing structs.
  • Make fold/reduce/cum\_reduce/cum\_fold serializable.
  • Make `Expr.append` serializable.
  • Float by float division dtype.
  • Division on empty DataFrame generating null row.
  • Partition sink `copy_exprs` and `with_exprs_and_input`.
  • Unreachable with `pl.self_dtype`.
  • Rolling median incorrect min\_samples with nulls.
  • Make `Int128` roundtrippable via Parquet.
  • Fix panic when common subplans contain IEJoins.
  • Properly handle non-finite floats in rolling\_sum/mean.
  • Make `read_csv_batched` respect `skip_rows` and `skip_lines`.
  • Always use `cloudpickle` for the python objects in cloud plans.
  • Support string literals in index\_of() on categoricals.
  • Don't panic for `finish_callback` with nested datatypes.
  • Pass `DeltaTable._storage_options` if no storage\_options are provided.
  • Support min/max aggregation for DataFrame/LazyFrame Categoricals.
  • Fix var/moment dtypes.
  • Fix agg\_groups dtype.
  • Fix incorrect `_get_path_scheme`.
  • Fix missing overload defaults in `read_ods` and `tree_format`.
  • Clear cached\_schema when apply changes dtype.
  • Allow structured conversion to/from numpy with Array types, preserving shape.
  • Null handling in full-null group\_by\_dynamic mean/sum.
  • Fix index calculation for `nearest` interpolation.
  • Overload for `eager` default in `Schema.to_frame` was `False` instead of `True`.
  • Fix `read_excel` overloads so that passing `list[str]` to `sheet_name` does not raise.
  • Removed special handling for bytes like objects in read\_ndjson.

🔧 Affected Symbols

SelectorCategoricalEnumExpr.sliceany()all()int_range(len())with_columnsColumnTransformColumnDataTypeExprpl.row_index()read/scan_delta()scan_iceberg()ExprRegistrystr.to_integer()arr.slicearr.headarr.tailis_closeto_dummiespad_startpad_endzfillCredentialProviderAWSto_datetime()dt.with_time_unitinterpolatecum_sum_horizontalpl.structstr.extract_groupsrolling_mapDeltaTable._storage_optionsBusinessDaycountstrptimelist.evalstr.replacejoin_asofget_index_type()Schema.to_frameread_excel