Change8

Accelerate

Data & ML

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Latest: v1.12.013 releases6 breaking changes11 common errorsView on GitHub

Release History

v1.12.02 fixes2 features
Nov 21, 2025

This release introduces major integration with Deepspeed Ulysses/ALST for sequence parallelism, enabling efficient long sequence training. It also includes several minor fixes and documentation updates.

v1.11.0Breaking15 fixes5 features
Oct 20, 2025

This release introduces support for TransformerEngine MXFP8 and enables FP16/BF16 training on MPS devices. It also drops support for Python 3.9 and brings numerous stability and feature updates across FSDP and nd-parallelism.

v1.10.11 fix1 feature
Aug 25, 2025

Version 1.10.1 introduces a new `to_json` utility and improves import safety for device mesh functionality.

v1.10.0Breaking6 fixes4 features
Aug 7, 2025

This release introduces comprehensive N-D Parallelism support via `ParallelismConfig` integrated with `Accelerator`, alongside significant FSDP improvements, particularly for MoE models.

v1.9.0Breaking6 fixes5 features
Jul 16, 2025

This release introduces native support for the trackio experiment tracking library and includes significant speedups for model loading, alongside various minor improvements and fixes for FSDP and DeepSpeed configurations.

v1.8.12 features
Jun 20, 2025

This minor release introduces support for the e5e2 model type and sets the default strategy to hybrid when using a launcher.

v1.8.0Breaking11 fixes11 features
Jun 19, 2025

This release introduces major refactoring for FSDPv2 setup, adding FP8 support, and significantly enhancing performance and stability for Intel CPU/XPU users. It also deprecates `ipex.optimize` and integrates SwanLab as a new experiment tracker.

v1.7.0Breaking8 fixes5 features
May 15, 2025

This release introduces significant performance improvements through regional compilation for torch.compile and adds layerwise casting hooks for memory optimization. It also brings substantial enhancements to FSDP2 support, including enabling `FULL_STATE_DICT` and fixing memory issues.

v1.6.010 fixes5 features
Apr 1, 2025

This release introduces major features including FSDPv2 support and initial DeepSpeed Tensor Parallelism support, alongside adding the XCCL distributed backend for XPU devices.

v1.5.22 fixes
Mar 14, 2025

This patch release (v1.5.2) focuses on resolving specific bugs related to device detection and production imports.

v1.5.04 fixes2 features
Mar 12, 2025

This release introduces HPU accelerator support and fixes several bugs related to device indexing, CLI argument precedence, and generator initialization.

v1.4.04 fixes2 features
Feb 17, 2025

This release introduces initial support for FP8 training via the `torchao` backend and adds initial Tensor Parallelism support for dataloaders, alongside several bug fixes including a critical memory leak resolution.

v1.3.0Breaking10 fixes2 features
Jan 17, 2025

This release enforces PyTorch 2.0 as the minimum required version and introduces improvements for handling compiled models, TPU execution, and various bug fixes across device support and offloading.

Common Errors

ChildFailedError17 reports

The "ChildFailedError" in Accelerate usually arises from unhandled exceptions or errors within the child processes during distributed training, such as a missing key in the model state dict or NCCL timeouts. To resolve this, carefully examine the complete traceback to identify the root cause (e.g., missing model weights or network issues), and address the specific error in your code or environment configuration. Increasing NCCL timeout values or ensuring all processes have access to necessary model components are common fixes.

DistBackendError7 reports

DistBackendError in accelerate often indicates communication problems between processes, such as NCCL timeouts or CUDA out-of-memory issues during distributed training/inference. Try reducing batch size, gradient accumulation steps, or using gradient checkpointing to alleviate memory pressure; for NCCL timeouts, increase the timeout value using the `NCCL_BLOCKING_WAIT=1 NCCL_DEBUG=INFO NCCL_TIMEOUT=<timeout_in_seconds>` environment variables to allow more time for communication. You can also set communication reduction strategies like `torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook`

ProcessRaisedException3 reports

ProcessRaisedException in accelerate often arises from CUDA initialization failures within subprocesses launched by `notebook_launcher` or similar utilities, particularly in environments like Kaggle notebooks. Address this by explicitly setting the CUDA device visible to the subprocesses using `torch.cuda.set_device(device_index)` or `os.environ["CUDA_VISIBLE_DEVICES"]=<device_index>` at the beginning of the launched function, ensuring each process uses an available and correctly initialized CUDA device or disabling CUDA if not needed (`torch.device("cpu")`). This helps isolate and manage CUDA context per process, preventing initialization conflicts.

OutOfMemoryError3 reports

OutOfMemoryError in accelerate usually occurs when the model or data is too large to fit in the available GPU memory, especially with large models or batch sizes. To fix it, try reducing batch size, enabling gradient accumulation, using CPU offloading, or leveraging techniques like model parallelism to distribute the model across multiple GPUs if available and properly configured with `accelerate config`.

FileNotFoundError3 reports

FileNotFoundError in accelerate often stems from missing files or incorrect paths specified in your code or within the accelerate configuration. Double-check that all necessary files (models, data, checkpoints, etc.) exist at the specified locations and that the paths are accurate relative to where your script is executed; explicitly use absolute paths to avoid ambiguity, especially in distributed environments. If using a configuration file, verify its integrity and contents using `accelerate config` to ensure it points to the right resources.

NotImplementedError2 reports

This error often arises when trying to directly move a `meta` device tensor (a placeholder without actual data) to another device like CUDA. Instead of `.to("cuda")`, use `torch.nn.Module.to_empty(device="cuda")` to properly initialize the module's weights on the target device, or handle data loading before moving tensors. Ensure all parts of your model and data are correctly initialized on the expected device before any computations.

Related Data & ML Packages

Subscribe to Updates

Get notified when new versions are released

RSS Feed