Multi-GPU Training Setup
= Multi-GPU Training Setup =
This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.
Prerequisites
- A Linux server with multiple NVIDIA GPUs: For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like Immers Cloud, which offers competitive pricing.
- NVIDIA Drivers and CUDA Toolkit: Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:
nvidia-smi
nvcc --version
sudo apt update && sudo apt install libopenmpi-dev openmpi-bin openmpi-common(For Debian/Ubuntu-based systems. Use `yum` or `dnf` for RHEL/CentOS/Fedora.)
Step 1: Install PyTorch with Distributed Support
PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.
1. Update pip:
pip install --upgrade pip
2. Install PyTorch: Visit the official PyTorch website ([https://pytorch.org/get-started/locally/]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Note: Always refer to the official PyTorch website for the most up-to-date installation commands.
3. Verify installation:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"This should output your PyTorch version and `True` if CUDA is available.
Step 2: Set up a Simple DDP Example
Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).
1. Create a Python script (e.g., `ddp_test.py`):
nano ddp_test.py
2. Paste the following code into the file: ```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp
def run(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # Use a free port
dist.init_process_group("nccl", rank=rank, world_size=world_size) print(f"Rank {rank}/{world_size} initialized.")
# Simple model model = nn.Linear(10, 10).to(rank) # Wrap the model with DDP ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
# Dummy data dummy_input = torch.randn(20, 10).to(rank) labels = torch.randn(20, 10).to(rank)
# Forward and backward pass outputs = ddp_model(dummy_input) loss_fn = nn.MSELoss() loss = loss_fn(outputs, labels) loss.backward()
print(f"Rank {rank} loss: {loss.item()}") dist.destroy_process_group()
if __name__ == "__main__": world_size = torch.cuda.device_count() # Use all available GPUs print(f"Using {world_size} GPUs.") mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```
3. Save and exit nano (Ctrl+X, Y, Enter).
4. Run the script:
python ddp_test.pyYou should see output indicating that each rank (GPU) has initialized and printed its loss.
Step 3: Installing DeepSpeed
DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.
1. Install DeepSpeed:
pip install deepspeed
2. Verify DeepSpeed installation:
python -c "import deepspeed; print(deepspeed.__version__)"
Step 4: DeepSpeed Example with DDP
DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.
1. Create a Python script (e.g., `deepspeed_test.py`):
nano deepspeed_test.py
2. Paste the following code into the file: ```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp import deepspeed
def run(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12356' # Use a different free port
dist.init_process_group("nccl", rank=rank, world_size=world_size) print(f"Rank {rank}/{world_size} initialized.")
# Simple model model = nn.Linear(10, 10).to(rank)
# Dummy data dummy_input = torch.randn(20, 10).to(rank) labels = torch.randn(20, 10).to(rank)
# DeepSpeed configuration (minimal example) ds_config = { "train_batch_size": 20, "optimizer": { "type": "Adam", "params": { "lr": 0.001 } }, "fp16": { "enabled": True } }
# Initialize DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config_params=ds_config )
# Forward and backward pass outputs = model_engine(dummy_input) loss_fn = nn.MSELoss() loss = loss_fn(outputs, labels) model_engine.backward(loss) optimizer.step() optimizer.zero_grad()
print(f"Rank {rank} loss: {loss.item()}") dist.destroy_process_group()
if __name__ == "__main__": world_size = torch.cuda.device_count() print(f"Using {world_size} GPUs.") mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```
3. Save and exit nano (Ctrl+X, Y, Enter).
4. Run the script using the DeepSpeed launcher:
deepspeed deepspeed_test.pyThis command automatically handles launching the script on each GPU.
Step 5: Real-World Training with DeepSpeed
For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.
1. Create a DeepSpeed configuration file (e.g., `ds_config.json`):
nano ds_config.json
2. Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options: ```json { "fp16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_optimization": { "stage": 1 }, "wall_clock_breakdown": false, "flops_profiler": { "enabled": false, "profile_step": 10 } } ```
3. Modify your training script (e.g., `my_training_script.py`) to load this configuration: ```python import deepspeed import torch
# ... your model definition, dataset loading, etc. ...
model = YourModel(...) optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it
# Initialize DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize( args=your_arg_parser_object, # If you use argparse model=model, model_parameters=model.parameters(), config_params="ds_config.json" # Path to your config file )
# ... your training loop ... # Use model_engine for forward, backward, and optimizer steps ```
4. Launch your training script:
deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.jsonReplace `N` with the number of GPUs you want to use.
Troubleshooting
Further Reading
Category:AI and GPU Category:Distributed Computing Category:Deep Learning