Accelerate
Data & ML🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Release History
v1.12.02 fixes2 featuresThis release introduces major integration with Deepspeed Ulysses/ALST for sequence parallelism, enabling efficient long sequence training. It also includes several minor fixes and documentation updates.
v1.11.0Breaking15 fixes5 featuresThis release introduces support for TransformerEngine MXFP8 and enables FP16/BF16 training on MPS devices. It also drops support for Python 3.9 and brings numerous stability and feature updates across FSDP and nd-parallelism.
v1.10.11 fix1 featureVersion 1.10.1 introduces a new `to_json` utility and improves import safety for device mesh functionality.
v1.10.0Breaking6 fixes4 featuresThis release introduces comprehensive N-D Parallelism support via `ParallelismConfig` integrated with `Accelerator`, alongside significant FSDP improvements, particularly for MoE models.
v1.9.0Breaking6 fixes5 featuresThis release introduces native support for the trackio experiment tracking library and includes significant speedups for model loading, alongside various minor improvements and fixes for FSDP and DeepSpeed configurations.
v1.8.12 featuresThis minor release introduces support for the e5e2 model type and sets the default strategy to hybrid when using a launcher.
v1.8.0Breaking11 fixes11 featuresThis release introduces major refactoring for FSDPv2 setup, adding FP8 support, and significantly enhancing performance and stability for Intel CPU/XPU users. It also deprecates `ipex.optimize` and integrates SwanLab as a new experiment tracker.
v1.7.0Breaking8 fixes5 featuresThis release introduces significant performance improvements through regional compilation for torch.compile and adds layerwise casting hooks for memory optimization. It also brings substantial enhancements to FSDP2 support, including enabling `FULL_STATE_DICT` and fixing memory issues.
v1.6.010 fixes5 featuresThis release introduces major features including FSDPv2 support and initial DeepSpeed Tensor Parallelism support, alongside adding the XCCL distributed backend for XPU devices.
v1.5.22 fixesThis patch release (v1.5.2) focuses on resolving specific bugs related to device detection and production imports.
v1.5.04 fixes2 featuresThis release introduces HPU accelerator support and fixes several bugs related to device indexing, CLI argument precedence, and generator initialization.
v1.4.04 fixes2 featuresThis release introduces initial support for FP8 training via the `torchao` backend and adds initial Tensor Parallelism support for dataloaders, alongside several bug fixes including a critical memory leak resolution.
v1.3.0Breaking10 fixes2 featuresThis release enforces PyTorch 2.0 as the minimum required version and introduces improvements for handling compiled models, TPU execution, and various bug fixes across device support and offloading.
Common Errors
ChildFailedError17 reportsThe "ChildFailedError" in Accelerate usually arises from unhandled exceptions or errors within the child processes during distributed training, such as a missing key in the model state dict or NCCL timeouts. To resolve this, carefully examine the complete traceback to identify the root cause (e.g., missing model weights or network issues), and address the specific error in your code or environment configuration. Increasing NCCL timeout values or ensuring all processes have access to necessary model components are common fixes.
DistBackendError7 reportsDistBackendError in accelerate often indicates communication problems between processes, such as NCCL timeouts or CUDA out-of-memory issues during distributed training/inference. Try reducing batch size, gradient accumulation steps, or using gradient checkpointing to alleviate memory pressure; for NCCL timeouts, increase the timeout value using the `NCCL_BLOCKING_WAIT=1 NCCL_DEBUG=INFO NCCL_TIMEOUT=<timeout_in_seconds>` environment variables to allow more time for communication. You can also set communication reduction strategies like `torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook`
ProcessRaisedException3 reportsProcessRaisedException in accelerate often arises from CUDA initialization failures within subprocesses launched by `notebook_launcher` or similar utilities, particularly in environments like Kaggle notebooks. Address this by explicitly setting the CUDA device visible to the subprocesses using `torch.cuda.set_device(device_index)` or `os.environ["CUDA_VISIBLE_DEVICES"]=<device_index>` at the beginning of the launched function, ensuring each process uses an available and correctly initialized CUDA device or disabling CUDA if not needed (`torch.device("cpu")`). This helps isolate and manage CUDA context per process, preventing initialization conflicts.
OutOfMemoryError3 reportsOutOfMemoryError in accelerate usually occurs when the model or data is too large to fit in the available GPU memory, especially with large models or batch sizes. To fix it, try reducing batch size, enabling gradient accumulation, using CPU offloading, or leveraging techniques like model parallelism to distribute the model across multiple GPUs if available and properly configured with `accelerate config`.
FileNotFoundError3 reportsFileNotFoundError in accelerate often stems from missing files or incorrect paths specified in your code or within the accelerate configuration. Double-check that all necessary files (models, data, checkpoints, etc.) exist at the specified locations and that the paths are accurate relative to where your script is executed; explicitly use absolute paths to avoid ambiguity, especially in distributed environments. If using a configuration file, verify its integrity and contents using `accelerate config` to ensure it points to the right resources.
NotImplementedError2 reportsThis error often arises when trying to directly move a `meta` device tensor (a placeholder without actual data) to another device like CUDA. Instead of `.to("cuda")`, use `torch.nn.Module.to_empty(device="cuda")` to properly initialize the module's weights on the target device, or handle data loading before moving tensors. Ensure all parts of your model and data are correctly initialized on the expected device before any computations.
Related Data & ML Packages
An Open Source Machine Learning Framework for Everyone
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
scikit-learn: machine learning in Python
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Streamlit — A faster way to build and share data apps.
Subscribe to Updates
Get notified when new versions are released