Change8

v1.8.0

Breaking Changes
📦 accelerateView on GitHub →
1 breaking11 features🐛 11 fixes1 deprecations🔧 12 symbols

Summary

This release introduces major refactoring for FSDPv2 setup, adding FP8 support, and significantly enhancing performance and stability for Intel CPU/XPU users. It also deprecates `ipex.optimize` and integrates SwanLab as a new experiment tracker.

⚠️ Breaking Changes

  • FSDPv2 setup is now more restrictive due to simplification of composition with features like FP8, torch.compile, and activation checkpointing. Users must adapt their FSDPv2 setup to the new, simplified composition method to avoid errors and ensure performance.

Migration Steps

  1. Review and update FSDPv2 setup to align with the new, simplified composition rules.
  2. For users relying on `ipex.optimize`, transition to native PyTorch optimizations if using PyTorch 2.8 or newer.
  3. Users training on Intel CPUs should verify or update their environment variables related to `CCL_WORKER_COUNT` and `KMP` parameters for performance gains.
  4. If encountering issues with tracker initialization in distributed setups, note that initialization is now deferred.

✨ New Features

  • Added support for FP8 with FSDPv2.
  • Improved distributed training performance on Intel CPUs by updating `CCL_WORKER_COUNT` and adding `KMP` parameters.
  • Added support for regional compilation with the DeepSpeed engine.
  • Greatly expanded and stabilized support for Intel XPUs across various features (FSDP2 benchmark, inference, testing, etc.).
  • Added support for SwanLab as an experiment tracking backend.
  • Deferred all experiment tracker initializations to prevent premature setup of distributed environments.
  • Added CPU offload capability.
  • Added support for standalone mode when the default port is occupied on a single node.
  • Added ability to pass kwargs to optimizer, scheduler, and dataloader using `accelerator().load_state()` function.
  • Added fp8_e5m2 support in `dtype_byte_size`.
  • Added DeepSpeed auto gradient accumulation synchronization from the DeepSpeed plugin.

🐛 Bug Fixes

  • Fixed bf16 training with Tensor Parallelism (TP).
  • Improved handling of FP8 with and without DeepSpeed.
  • Fixed issues related to Gaudi Runners.
  • Removed reliance on `torch_ccl`.
  • Resolved logger warnings.
  • Fixed an issue where `list object has no attribute keys` occurred.
  • Removed device_count for TPU launcher to avoid initializing runtime.
  • Fixed missing `te.LayerNorm` in `intel_transformer_engine`.
  • Removed hardcoded cuda dependency in fsdpv2 setup.
  • Fixed correct labels for fsdp2 examples.
  • Fixed gradient accumulation issues with DeepSpeed.

🔧 Affected Symbols

FSDPv2ipex.optimizetorch.nn.Module.compiletorch.compileCCL_WORKER_COUNTKMP parametersDeepSpeed engineSwanLabaccelerator().load_state()dtype_byte_sizete.LayerNormTorchTensorParallelPlugin

⚡ Deprecations

  • `ipex.optimize` is deprecated as most optimizations have been upstreamed to PyTorch. Users should rely on PyTorch native optimizations, though IPEX will be used for users without PyTorch 2.8.