PyTorch Lightning
Data & MLPretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
Release History
2.6.1Breaking9 fixes3 featuresThis patch introduces method chaining for freezing/unfreezing modules and adds litlogger integration. It also removes support for Python 3.9 and fixes several bugs related to checkpointing, hyperparameter saving, and distributed sampling.
2.6.014 fixes7 featuresVersion 2.6.0 introduces several new features like WeightAveraging callbacks and Torch-Tensorrt integration, alongside numerous bug fixes across PyTorch Lightning and Fabric components.
2.5.61 featureThis release introduces a new `name()` function to the accelerator interface and removes support for the deprecated lightning-habana package.
2.5.56 fixesThis patch release for PyTorch Lightning and Lightning Fabric focuses on bug fixes, including issues with `LightningCLI`, `ModelCheckpoint` saving logic, and progress bar resetting. It also includes updates for PyTorch 2.8 compatibility.
2.5.45 fixes1 featureThis patch release for PyTorch Lightning focuses on bug fixes across checkpointing, callbacks, and strategy integrations. Lightning Fabric also added support for NVIDIA H200 GPUs.
2.5.313 fixes5 featuresThis release brings numerous bug fixes across PyTorch Lightning and Lightning Fabric, including improvements to checkpointing, logging, profiling, and progress bar rendering. New features include more flexibility in ModelCheckpoint options and handling of training_step returns.
2.5.28 fixes1 featureThis release introduces the `toggled_optimizer` context manager to LightningModule and resolves several bugs related to CLI integration, DDP synchronization, and checkpointing. Users are advised to update `fsspec` for cross-device checkpointing.
2.5.1.post0This is a post-release update (2.5.1.post0) following version 2.5.1, with details available in the linked comparison.
2.5.110 fixes4 featuresThis release introduces enhancements for logging integrations like MLflow and CometML, allows customization of LightningCLI argument parsing, and fixes several bugs related to logging latency, checkpoint resumption, and logger behavior. Legacy support for `lightning run model` has been removed in favor of `fabric run`.
Common Errors
FileNotFoundError4 reportsFileNotFoundError in pytorch-lightning often arises when file paths used for saving checkpoints, configurations, or logs are invalid or the destination directory doesn't exist. To fix this, ensure all specified directories exist before writing to them, using `os.makedirs(path, exist_ok=True)` to create them if needed; also, validate that file paths are correctly formed, especially when dealing with absolute paths or paths with special characters on different operating systems. Consider using `os.path.join` for robust path construction.
NotImplementedError2 reportsThe "NotImplementedError" in PyTorch Lightning usually arises when a required method (like `training_step`, `configure_optimizers`, or callbacks expecting specific hooks) is not defined in your LightningModule or Callback. Resolve this by ensuring you've overridden all necessary methods in your LightningModule or Callback classes with your custom logic, paying close attention to the expected inputs and outputs for each method as defined in the PyTorch Lightning documentation. If using callbacks, verify existing inherited methods.
ProcessExitedException2 reportsProcessExitedException in PyTorch Lightning tests often arises from unexpected process termination during multi-processing scenarios such as `ddp_fork`. This commonly stems from resource exhaustion, unhandled exceptions within child processes, or conflicts with system-level libraries. To resolve it, ensure adequate system resources (RAM, CPU), implement robust error handling within the child processes, and check for library incompatibilities, especially with multiprocessing on MacOS.
OutOfMemoryError2 reportsOutOfMemoryError in PyTorch Lightning typically occurs when the GPU runs out of memory during training. Fix this by reducing the `batch_size` in your DataLoader, using gradient accumulation with `accumulate_grad_batches` in the Trainer, or offloading computations to the CPU using `torch.cuda.empty_cache()` periodically and enabling `model.half()` for mixed precision training if applicable. Consider larger GPUs or distributed training for further memory relief.
DistNetworkError1 reportDistNetworkError in PyTorch Lightning distributed tests often arises from address conflicts, specifically the EADDRINUSE error, indicating a port is already in use. To fix, specify available ports by setting the MASTER_PORT environment variable using os.environ["MASTER_PORT"] = str(find_free_port()), or configure the TCP store to find an open port automatically, mitigating address collisions during distributed initialization.
WandbAttachFailedError1 reportWandbAttachFailedError in pytorch-lightning often arises when Wandb is initialized outside of the main process in a distributed training setting, especially with TPUs, which interferes with proper experiment tracking. To fix this, ensure Wandb is only initialized within the main process (rank 0) by using a conditional check like `if self.trainer.global_rank == 0: wandb.init(...)` or utilize pytorch-lightning's built-in WandbLogger to handle this automatically.
Related Data & ML Packages
An Open Source Machine Learning Framework for Everyone
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
scikit-learn: machine learning in Python
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Streamlit — A faster way to build and share data apps.
Subscribe to Updates
Get notified when new versions are released