Change8

v1.7.0

Breaking Changes
📦 accelerateView on GitHub →
1 breaking5 features🐛 8 fixes🔧 4 symbols

Summary

This release introduces significant performance improvements through regional compilation for torch.compile and adds layerwise casting hooks for memory optimization. It also brings substantial enhancements to FSDP2 support, including enabling `FULL_STATE_DICT` and fixing memory issues.

⚠️ Breaking Changes

  • The logic for setting `self.dynamic` in `TorchDynamoPlugin` now explicitly preserves `None` instead of defaulting to `False` when the `USE_DYNAMIC` environment variable is unset. This aligns behavior with PyTorch documentation for torch.compile. Users relying on the previous default behavior might need to explicitly set `USE_DYNAMIC` if they require `False`.

Migration Steps

  1. To enable regional compilation, set `use_regional_compilation=True` in your `TorchDynamoPlugin` configuration and pass the plugin when initializing `Accelerator`.
  2. To use layerwise casting hooks, call `attach_layerwise_casting_hooks(model, storage_dtype=..., compute_dtype=...)` before preparing the model.
  3. If you rely on the default behavior of `torch.compile` when `USE_DYNAMIC` is unset, be aware that `self.dynamic` will now be `None` instead of `False`.

✨ New Features

  • Introduced regional compilation via `use_regional_compilation=True` in `TorchDynamoPlugin` to significantly reduce cold start compilation time by targeting repeated blocks first.
  • Added a Layerwise casting hook via `attach_layerwise_casting_hooks` to enable per-layer upcasting/downcasting (e.g., for Linear layers) during inference, allowing separate storage and compute dtypes for memory savings.
  • Enabled support for `FULL_STATE_DICT` in FSDP2, allowing `.save_pretrained()` to work correctly with FSDP2 wrapped models.
  • Added support for QLoRA training with FSDP2 (requires more testing).
  • Added documentation and configuration support for Intel Gaudi hardware (HPU).

🐛 Bug Fixes

  • Resolved a backend issue related to parameter offloading to CPU when using CUDA+FSDP2+cpu-offload.
  • Fixed a significant memory spike that occurred when `cpu_ram_efficient_loading=True` was enabled.
  • Fixed an issue where the `unsafe_serialization` option in "merge-weights" did not work.
  • Fixed logic in `accelerator.prepare` concerning IPEX compatibility with 2+ `nn.Models` and/or `optim.Optimizers`.
  • Fixed an issue where an unwanted CUDA initialization occurred due to torchao.
  • Fixed issues in FSDP2 Wrap Policy and Mixed Precision.
  • Fixed an FSDP2 issue where an object was incorrectly identified as not being a buffer or parameter.
  • Fixed `notebook_launcher` for Colab TPU compatibility.

🔧 Affected Symbols

TorchDynamoPluginAcceleratorattach_layerwise_casting_hookstorch.compile