Multi-GPU Training Setup

= Multi-GPU Training Setup =

This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.

Prerequisites

A Linux server with multiple NVIDIA GPUs: For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like Immers Cloud, which offers competitive pricing.
NVIDIA Drivers and CUDA Toolkit: Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:

nvidia-smi

nvcc --version

Python 3 and pip: A working Python 3 installation is required.
SSH access: You'll need SSH access to your server with a user that has sufficient privileges to install software.
Basic Linux command-line knowledge: Familiarity with navigating the file system, installing packages, and running commands.
(Optional) MPI (Message Passing Interface): While not strictly required for PyTorch DDP itself, MPI can be useful for certain advanced distributed computing scenarios and is often a dependency for other distributed training frameworks. You can install it using:

sudo apt update && sudo apt install libopenmpi-dev openmpi-bin openmpi-common

Step 1: Install PyTorch with Distributed Support

PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.

1. Update pip:

pip install --upgrade pip

2. Install PyTorch: Visit the official PyTorch website ([https://pytorch.org/get-started/locally/]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Note:

3. Verify installation:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

Step 2: Set up a Simple DDP Example

Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).

1. Create a Python script (e.g., `ddp_test.py`):

nano ddp_test.py

2. Paste the following code into the file: ```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp

def run(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # Use a free port

dist.init_process_group("nccl", rank=rank, world_size=world_size) print(f"Rank {rank}/{world_size} initialized.")

# Simple model model = nn.Linear(10, 10).to(rank) # Wrap the model with DDP ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

# Dummy data dummy_input = torch.randn(20, 10).to(rank) labels = torch.randn(20, 10).to(rank)

# Forward and backward pass outputs = ddp_model(dummy_input) loss_fn = nn.MSELoss() loss = loss_fn(outputs, labels) loss.backward()

print(f"Rank {rank} loss: {loss.item()}") dist.destroy_process_group()

if __name__ == "__main__": world_size = torch.cuda.device_count() # Use all available GPUs print(f"Using {world_size} GPUs.") mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```

3. Save and exit nano (Ctrl+X, Y, Enter).

4. Run the script:

python ddp_test.py

Step 3: Installing DeepSpeed

DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.

1. Install DeepSpeed:

pip install deepspeed

2. Verify DeepSpeed installation:

python -c "import deepspeed; print(deepspeed.__version__)"

Step 4: DeepSpeed Example with DDP

DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.

1. Create a Python script (e.g., `deepspeed_test.py`):

nano deepspeed_test.py

2. Paste the following code into the file: ```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp import deepspeed

def run(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12356' # Use a different free port

dist.init_process_group("nccl", rank=rank, world_size=world_size) print(f"Rank {rank}/{world_size} initialized.")

# Simple model model = nn.Linear(10, 10).to(rank)

# Dummy data dummy_input = torch.randn(20, 10).to(rank) labels = torch.randn(20, 10).to(rank)

# DeepSpeed configuration (minimal example) ds_config = { "train_batch_size": 20, "optimizer": { "type": "Adam", "params": { "lr": 0.001 } }, "fp16": { "enabled": True } }

# Initialize DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config_params=ds_config )

# Forward and backward pass outputs = model_engine(dummy_input) loss_fn = nn.MSELoss() loss = loss_fn(outputs, labels) model_engine.backward(loss) optimizer.step() optimizer.zero_grad()

print(f"Rank {rank} loss: {loss.item()}") dist.destroy_process_group()

if __name__ == "__main__": world_size = torch.cuda.device_count() print(f"Using {world_size} GPUs.") mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```

3. Save and exit nano (Ctrl+X, Y, Enter).

4. Run the script using the DeepSpeed launcher:

deepspeed deepspeed_test.py

Step 5: Real-World Training with DeepSpeed

For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.

1. Create a DeepSpeed configuration file (e.g., `ds_config.json`):

nano ds_config.json

2. Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options: ```json { "fp16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_optimization": { "stage": 1 }, "wall_clock_breakdown": false, "flops_profiler": { "enabled": false, "profile_step": 10 } } ```

3. Modify your training script (e.g., `my_training_script.py`) to load this configuration: ```python import deepspeed import torch

# ... your model definition, dataset loading, etc. ...

model = YourModel(...) optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it

# Initialize DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize( args=your_arg_parser_object, # If you use argparse model=model, model_parameters=model.parameters(), config_params="ds_config.json" # Path to your config file )

# ... your training loop ... # Use model_engine for forward, backward, and optimizer steps ```

4. Launch your training script:

deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.json

Troubleshooting

`torch.distributed.errors.BackendError: NCCL error`:

Cause:

Solution:

`RuntimeError: CUDA error: out of memory`:

Cause:

Solution:

Process hangs or deadlocks:

Cause:

Solution:

Performance is lower than expected:

Cause:

Solution:

Multi-GPU Training Setup

Prerequisites

Step 1: Install PyTorch with Distributed Support

Step 2: Set up a Simple DDP Example

Step 3: Installing DeepSpeed

Step 4: DeepSpeed Example with DDP

Step 5: Real-World Training with DeepSpeed

Troubleshooting

Further Reading