Multi-GPU Training Setup

This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.

Prerequisites

A Linux server with multiple NVIDIA GPUs: For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like Immers Cloud, which offers competitive pricing.
NVIDIA Drivers and CUDA Toolkit: Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:

nvidia-smi

nvcc --version

Python 3 and pip: A working Python 3 installation is required.
SSH access: You'll need SSH access to your server with a user that has sufficient privileges to install software.
Basic Linux command-line knowledge: Familiarity with navigating the file system, installing packages, and running commands.
(Optional) MPI (Message Passing Interface): While not strictly required for PyTorch DDP itself, MPI can be useful for certain advanced distributed computing scenarios and is often a dependency for other distributed training frameworks. You can install it using:

sudo apt update && sudo apt install libopenmpi-dev openmpi-bin openmpi-common

(For Debian/Ubuntu-based systems. Use `yum` or `dnf` for RHEL/CentOS/Fedora.)

Step 1: Install PyTorch with Distributed Support

PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.

1. Update pip:

pip install --upgrade pip

2. Install PyTorch: Visit the official PyTorch website ([1]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

  Note: Always refer to the official PyTorch website for the most up-to-date installation commands.

3. Verify installation:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

  This should output your PyTorch version and `True` if CUDA is available.

Step 2: Set up a Simple DDP Example

Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).

1. Create a Python script (e.g., `ddp_test.py`):

nano ddp_test.py

2. Paste the following code into the file:

  ```python
  import os
  import torch
  import torch.nn as nn
  import torch.distributed as dist
  import torch.multiprocessing as mp

  def run(rank, world_size):
      os.environ['MASTER_ADDR'] = 'localhost'
      os.environ['MASTER_PORT'] = '12355' # Use a free port

      dist.init_process_group("nccl", rank=rank, world_size=world_size)
      print(f"Rank {rank}/{world_size} initialized.")

      # Simple model
      model = nn.Linear(10, 10).to(rank)
      # Wrap the model with DDP
      ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

      # Dummy data
      dummy_input = torch.randn(20, 10).to(rank)
      labels = torch.randn(20, 10).to(rank)

      # Forward and backward pass
      outputs = ddp_model(dummy_input)
      loss_fn = nn.MSELoss()
      loss = loss_fn(outputs, labels)
      loss.backward()

      print(f"Rank {rank} loss: {loss.item()}")
      dist.destroy_process_group()

  if __name__ == "__main__":
      world_size = torch.cuda.device_count() # Use all available GPUs
      print(f"Using {world_size} GPUs.")
      mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
  ```

3. Save and exit nano (Ctrl+X, Y, Enter).

4. Run the script:

python ddp_test.py

  You should see output indicating that each rank (GPU) has initialized and printed its loss.

Step 3: Installing DeepSpeed

DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.

1. Install DeepSpeed:

pip install deepspeed

2. Verify DeepSpeed installation:

python -c "import deepspeed; print(deepspeed.__version__)"

Step 4: DeepSpeed Example with DDP

DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.

1. Create a Python script (e.g., `deepspeed_test.py`):

nano deepspeed_test.py

2. Paste the following code into the file:

  ```python
  import os
  import torch
  import torch.nn as nn
  import torch.distributed as dist
  import torch.multiprocessing as mp
  import deepspeed

  def run(rank, world_size):
      os.environ['MASTER_ADDR'] = 'localhost'
      os.environ['MASTER_PORT'] = '12356' # Use a different free port

      dist.init_process_group("nccl", rank=rank, world_size=world_size)
      print(f"Rank {rank}/{world_size} initialized.")

      # Simple model
      model = nn.Linear(10, 10).to(rank)

      # Dummy data
      dummy_input = torch.randn(20, 10).to(rank)
      labels = torch.randn(20, 10).to(rank)

      # DeepSpeed configuration (minimal example)
      ds_config = {
          "train_batch_size": 20,
          "optimizer": {
              "type": "Adam",
              "params": {
                  "lr": 0.001
              }
          },
          "fp16": {
              "enabled": True
          }
      }

      # Initialize DeepSpeed
      model_engine, optimizer, _, _ = deepspeed.initialize(
          model=model,
          model_parameters=model.parameters(),
          config_params=ds_config
      )

      # Forward and backward pass
      outputs = model_engine(dummy_input)
      loss_fn = nn.MSELoss()
      loss = loss_fn(outputs, labels)
      model_engine.backward(loss)
      optimizer.step()
      optimizer.zero_grad()

      print(f"Rank {rank} loss: {loss.item()}")
      dist.destroy_process_group()

  if __name__ == "__main__":
      world_size = torch.cuda.device_count()
      print(f"Using {world_size} GPUs.")
      mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
  ```

3. Save and exit nano (Ctrl+X, Y, Enter).

4. Run the script using the DeepSpeed launcher:

deepspeed deepspeed_test.py

  This command automatically handles launching the script on each GPU.

Step 5: Real-World Training with DeepSpeed

For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.

1. Create a DeepSpeed configuration file (e.g., `ds_config.json`):

nano ds_config.json

2. Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options:

  ```json
  {
    "fp16": {
      "enabled": true
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
      }
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_optimization": {
      "stage": 1
    },
    "wall_clock_breakdown": false,
    "flops_profiler": {
      "enabled": false,
      "profile_step": 10
    }
  }
  ```

3. Modify your training script (e.g., `my_training_script.py`) to load this configuration:

  ```python
  import deepspeed
  import torch

  # ... your model definition, dataset loading, etc. ...

  model = YourModel(...)
  optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config
  lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it

  # Initialize DeepSpeed
  model_engine, optimizer, _, _ = deepspeed.initialize(
      args=your_arg_parser_object, # If you use argparse
      model=model,
      model_parameters=model.parameters(),
      config_params="ds_config.json" # Path to your config file
  )

  # ... your training loop ...
  # Use model_engine for forward, backward, and optimizer steps
  ```

4. Launch your training script:

deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.json

  Replace `N` with the number of GPUs you want to use.

Troubleshooting

`torch.distributed.errors.BackendError: NCCL error`:

   * Cause: Often due to network issues between nodes (if distributed across multiple machines), incorrect CUDA/NCCL versions, or insufficient GPU memory.
   * Solution:
       * Ensure all nodes can communicate with each other on the specified ports.
       * Verify that your PyTorch and CUDA versions are compatible with your NVIDIA drivers and NCCL.
       * Reduce batch size or enable DeepSpeed's memory optimization features (like ZeRO).

`RuntimeError: CUDA error: out of memory`:

   * Cause: Your model or batch size is too large for the GPU memory.
   * Solution:
       * Reduce the `train_batch_size` or `train_micro_batch_size_per_gpu` in your DeepSpeed config.
       * Use `gradient_accumulation_steps` to simulate larger batch sizes.
       * Enable `fp16` training in your DeepSpeed config.
       * Consider using DeepSpeed's ZeRO optimization stages (stage 2 or 3) for more aggressive memory savings.

Process hangs or deadlocks:

   * Cause: Incorrect initialization of the process group, or issues with `torch.multiprocessing`.
   * Solution:
       * Double-check that `MASTER_ADDR` and `MASTER_PORT` are correctly set and accessible by all processes.
       * Ensure that `dist.init_process_group` is called by all processes.
       * For multi-node setups, ensure `RANK` and `WORLD_SIZE` environment variables are correctly set on each node.

Performance is lower than expected:

   * Cause: Bottlenecks in data loading, inefficient model architecture, or suboptimal DeepSpeed configuration.
   * Solution:
       * Profile your data loading pipeline.
       * Experiment with different DeepSpeed configuration parameters (e.g., `zero_optimization`, `offload`).
       * Ensure your `train_micro_batch_size_per_gpu` is large enough to keep GPUs utilized.

Multi-GPU Training Setup

Contents

Multi-GPU Training Setup

Prerequisites

Step 1: Install PyTorch with Distributed Support

Step 2: Set up a Simple DDP Example

Step 3: Installing DeepSpeed

Step 4: DeepSpeed Example with DDP

Step 5: Real-World Training with DeepSpeed

Troubleshooting

Further Reading

Read Also

Navigation menu

Search