Change8

v1.11.0

Breaking Changes
📦 accelerate
1 breaking5 features🐛 15 fixes1 deprecations🔧 7 symbols

Summary

This release introduces support for TransformerEngine MXFP8 and enables FP16/BF16 training on MPS devices. It also drops support for Python 3.9 and brings numerous stability and feature updates across FSDP and nd-parallelism.

⚠️ Breaking Changes

  • Support for Python 3.9 has been dropped as it reached End-of-Life. Users must upgrade to Python 3.10 or newer.

Migration Steps

  1. If you are using Python 3.9, you must upgrade your environment to Python 3.10 or newer.
  2. To use TE MXFP8 support, configure `use_mxfp8_block_scaling` in your `fp8_config`.
  3. When training on MPS devices, ensure you have torch >= 2.8 for fp16 or torch >= 2.6 for bf16 if using mixed precision.

✨ New Features

  • Added support for TransformerEngine (TE) MXFP8 recipe, configurable via setting `use_mxfp8_block_scaling` in `fp8_config`.
  • Enabled FP16 and BF16 mixed precision training support for MPS (Apple Silicon) devices. FP16 requires torch >= 2.8, and BF16 requires torch >= 2.6.
  • FSDPv2 now supports `ignored_params` configuration.
  • FSDPv2 now supports disabling gradient synchronization via `model.set_requires_gradient_sync(False)`.
  • Mixed precision policy can now be passed as a dtype string via the accelerate CLI flag or `fsdp_config` in the accelerate config file.

🐛 Bug Fixes

  • Fixed CPU RAM efficient loading issues for nd-parallelism and HSDP parallelisms.
  • Fixed xpu INT64 all_gather issue present in version 2.9.
  • Specified `device_ids` in `torch.distributed.barrier` for `PartialState` usage.
  • Fixed specifying device for `process_tensor` in example usage.
  • Improved complexity of `get_balanced_memory` by adding a set.
  • Fixed skipping cuda cache flush when the origin device is `cpu` and offloaded to `meta`.
  • Fixed conversion of LayerNorm without bias to fp8.
  • Switched XPU ccl backend to torch-builtin xccl in `test_zero3_integration`.
  • Fixed FSDP2 test case failure on XPU.
  • Fixed tests related to tracking swanlab.
  • Fixed `SWANLAB_MODE` tracking.
  • Fixed Muti node CUDA error: invalid device ordinal.
  • Ensured `reset_peak_memory_stats` is used on xpu.
  • Fixed `torch_npu` import error in some environments.
  • Fixed a typo that caused tests to fail.

🔧 Affected Symbols

Accelerator.autocast()FindTiedParametersResultfp8_configmodel.set_requires_gradient_syncget_balanced_memoryget_parameters_from_modulestorch.distributed.barrier

⚡ Deprecations

  • The deprecated `FindTiedParametersResult` has been removed.