v1.13.0

📅 Mar 4, 2026📦 accelerateView on GitHub →

✨ 5 features🐛 34 fixes⚡ 1 deprecations🔧 19 symbols

Summary

This release introduces official support for AWS Neuron devices and brings significant performance and stability improvements across FSDP2, DeepSpeed Sequence Parallelism, and XPU handling. The library also features faster imports by deferring heavy dependency loading.

Migration Steps

If using DeepSpeed Sequence Parallelism, note that DeepSpeed now manages its own process group.
If encountering issues with FSDP2 and torch < 2.7.0, ensure your setup is compatible or update torch.

✨ New Features

Added support for AWS Neuron (Trainium/Inferentia) devices.
Improved XPU device-agnostic code by removing IPEX dependency and using spawn instead of fork.
Enhanced FP8 training support, including fixes for torchao default config with padding and FSDP2 all-gather.
Accelerate now imports faster by deferring heavy dependencies.
torch.compile hooks are now disabled lazily.

🐛 Bug Fixes

Fixed KMP_AFFINITY incorrectly setting for non-CPU training.
FSDP2: Upcast parameters only if requires_grad.
Fixed FSDP2 tied embedding errors with targeted ValueError guidance.
Fixed issue where FSDP cannot load optimizer state using DCP.
Fixed crash in optimizer.step when FSDP2 is enabled and model is bfloat16.
Fixed FSDP2 crash with ignored_params on torch < 2.7.0 compatibility.
Fixed DeepSpeed Sequence Parallelism (SP) loss computation example.
Fixed error when both DeepSpeed Custom Parallelism (CP) and SP are enabled.
Fixed DeepSpeed SP integration to error out if both CP and SP enabled.
Fixed DeepSpeed SP integration to skip device mesh creation when deepspeed and sp_size > 1.
Enabled evaluation during DeepSpeed Sequence Parallelism training.
Fixed FP8 torchao default config with padding and FSDP2 all-gather support.
Fixed FP8 execution with Transformer Engine.
Allowed non-Tensor values in a batch when dispatch_batches=True.
Fixed module and optimizer parameter mismatch before prepare_tp_.
Fixed KeyError in extract_model_from_parallel for partial torch.compile.
Fixed hf_device_map device index comparison in prepare_model.
Fixed StatefulDataLoader KeyError when num_workers > 0.
Fixed stateful dataloader DDP issues.
Removed duplicate W&B initialization in offline mode.
Avoided using nvidia-smi on a CPU-only Colab instance.
Fixed logging logic when in_order is set to True.
Fixed CPU offload check.
Fixed bug when both cpu_ram_efficient_loading and cpu_offload are enabled.
Fixed async compatibility across Python versions.
Fixed TP only bug.
Fixed parallelism_config None error.
Fixed Numpy parallelism issue.
Changed default value of fsdp_min_num_params to int.
Fixed mutable default in Megatron init and IndexError on empty ModuleList.
Fixed Prepare TP issue.
Removed 8bit force hook for bnb.
Fixed RNG state setting for HPU.
Fixed loading the HPU RNG state.

Affected Symbols

FSDP2 torch.compile IPEX KMP_AFFINITY DCP optimizer loading DeepSpeed Sequence Parallelism (SP)torchao Transformer Engine dispatch_batches prepare_tp_extract_model_from_parallel hf_device_map StatefulDataLoader W&B initialization nvidia-smi cpu_ram_efficient_loading cpu_offload Megatron-LM HPU RNG state

⚡ Deprecations

MS-AMP is now flagged as deprecated in low-precision training guides.