v1.11.0
Breaking Changes📦 accelerate
⚠ 1 breaking✨ 5 features🐛 15 fixes⚡ 1 deprecations🔧 7 symbols
Summary
This release introduces support for TransformerEngine MXFP8 and enables FP16/BF16 training on MPS devices. It also drops support for Python 3.9 and brings numerous stability and feature updates across FSDP and nd-parallelism.
⚠️ Breaking Changes
- Support for Python 3.9 has been dropped as it reached End-of-Life. Users must upgrade to Python 3.10 or newer.
Migration Steps
- If you are using Python 3.9, you must upgrade your environment to Python 3.10 or newer.
- To use TE MXFP8 support, configure `use_mxfp8_block_scaling` in your `fp8_config`.
- When training on MPS devices, ensure you have torch >= 2.8 for fp16 or torch >= 2.6 for bf16 if using mixed precision.
✨ New Features
- Added support for TransformerEngine (TE) MXFP8 recipe, configurable via setting `use_mxfp8_block_scaling` in `fp8_config`.
- Enabled FP16 and BF16 mixed precision training support for MPS (Apple Silicon) devices. FP16 requires torch >= 2.8, and BF16 requires torch >= 2.6.
- FSDPv2 now supports `ignored_params` configuration.
- FSDPv2 now supports disabling gradient synchronization via `model.set_requires_gradient_sync(False)`.
- Mixed precision policy can now be passed as a dtype string via the accelerate CLI flag or `fsdp_config` in the accelerate config file.
🐛 Bug Fixes
- Fixed CPU RAM efficient loading issues for nd-parallelism and HSDP parallelisms.
- Fixed xpu INT64 all_gather issue present in version 2.9.
- Specified `device_ids` in `torch.distributed.barrier` for `PartialState` usage.
- Fixed specifying device for `process_tensor` in example usage.
- Improved complexity of `get_balanced_memory` by adding a set.
- Fixed skipping cuda cache flush when the origin device is `cpu` and offloaded to `meta`.
- Fixed conversion of LayerNorm without bias to fp8.
- Switched XPU ccl backend to torch-builtin xccl in `test_zero3_integration`.
- Fixed FSDP2 test case failure on XPU.
- Fixed tests related to tracking swanlab.
- Fixed `SWANLAB_MODE` tracking.
- Fixed Muti node CUDA error: invalid device ordinal.
- Ensured `reset_peak_memory_stats` is used on xpu.
- Fixed `torch_npu` import error in some environments.
- Fixed a typo that caused tests to fail.
🔧 Affected Symbols
Accelerator.autocast()FindTiedParametersResultfp8_configmodel.set_requires_gradient_syncget_balanced_memoryget_parameters_from_modulestorch.distributed.barrier⚡ Deprecations
- The deprecated `FindTiedParametersResult` has been removed.