Admin: New server guide

2026-04-13T10:00:11Z

New server guide

New page

= Multi-GPU Training Setup =

This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.

== Prerequisites ==
* '''A Linux server with multiple NVIDIA GPUs:''' For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud], which offers competitive pricing.
* '''NVIDIA Drivers and CUDA Toolkit:''' Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:
<pre>nvidia-smi</pre>
<pre>nvcc --version</pre>
* '''Python 3 and pip:''' A working Python 3 installation is required.
* '''SSH access:''' You'll need SSH access to your server with a user that has sufficient privileges to install software.
* '''Basic Linux command-line knowledge:''' Familiarity with navigating the file system, installing packages, and running commands.
* '''(Optional) MPI (Message Passing Interface):''' While not strictly required for PyTorch DDP itself, MPI can be useful for certain advanced distributed computing scenarios and is often a dependency for other distributed training frameworks. You can install it using:
<pre>sudo apt update && sudo apt install libopenmpi-dev openmpi-bin openmpi-common</pre>
(For Debian/Ubuntu-based systems. Use `yum` or `dnf` for RHEL/CentOS/Fedora.)

== Step 1: Install PyTorch with Distributed Support ==

PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.

1. '''Update pip:'''
<pre>pip install --upgrade pip</pre>

2. '''Install PyTorch:''' Visit the official PyTorch website ([https://pytorch.org/get-started/locally/]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:
<pre>pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118</pre>
'''Note:''' Always refer to the official PyTorch website for the most up-to-date installation commands.

3. '''Verify installation:'''
<pre>python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"</pre>
This should output your PyTorch version and `True` if CUDA is available.

== Step 2: Set up a Simple DDP Example ==

Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).

1. '''Create a Python script (e.g., `ddp_test.py`):'''
<pre>nano ddp_test.py</pre>

2. '''Paste the following code into the file:'''
```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp

def run(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355' # Use a free port

dist.init_process_group("nccl", rank=rank, world_size=world_size)
print(f"Rank {rank}/{world_size} initialized.")

# Simple model
model = nn.Linear(10, 10).to(rank)
# Wrap the model with DDP
ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

# Dummy data
dummy_input = torch.randn(20, 10).to(rank)
labels = torch.randn(20, 10).to(rank)

# Forward and backward pass
outputs = ddp_model(dummy_input)
loss_fn = nn.MSELoss()
loss = loss_fn(outputs, labels)
loss.backward()

print(f"Rank {rank} loss: {loss.item()}")
dist.destroy_process_group()

if __name__ == "__main__":
world_size = torch.cuda.device_count() # Use all available GPUs
print(f"Using {world_size} GPUs.")
mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
```

3. '''Save and exit nano''' (Ctrl+X, Y, Enter).

4. '''Run the script:'''
<pre>python ddp_test.py</pre>
You should see output indicating that each rank (GPU) has initialized and printed its loss.

== Step 3: Installing DeepSpeed ==

DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.

1. '''Install DeepSpeed:'''
<pre>pip install deepspeed</pre>

2. '''Verify DeepSpeed installation:'''
<pre>python -c "import deepspeed; print(deepspeed.__version__)"</pre>

== Step 4: DeepSpeed Example with DDP ==

DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.

1. '''Create a Python script (e.g., `deepspeed_test.py`):'''
<pre>nano deepspeed_test.py</pre>

2. '''Paste the following code into the file:'''
```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
import deepspeed

def run(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12356' # Use a different free port

dist.init_process_group("nccl", rank=rank, world_size=world_size)
print(f"Rank {rank}/{world_size} initialized.")

# Simple model
model = nn.Linear(10, 10).to(rank)

# Dummy data
dummy_input = torch.randn(20, 10).to(rank)
labels = torch.randn(20, 10).to(rank)

# DeepSpeed configuration (minimal example)
ds_config = {
"train_batch_size": 20,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001
}
},
"fp16": {
"enabled": True
}
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
model_parameters=model.parameters(),
config_params=ds_config
)

# Forward and backward pass
outputs = model_engine(dummy_input)
loss_fn = nn.MSELoss()
loss = loss_fn(outputs, labels)
model_engine.backward(loss)
optimizer.step()
optimizer.zero_grad()

print(f"Rank {rank} loss: {loss.item()}")
dist.destroy_process_group()

if __name__ == "__main__":
world_size = torch.cuda.device_count()
print(f"Using {world_size} GPUs.")
mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
```

3. '''Save and exit nano''' (Ctrl+X, Y, Enter).

4. '''Run the script using the DeepSpeed launcher:'''
<pre>deepspeed deepspeed_test.py</pre>
This command automatically handles launching the script on each GPU.

== Step 5: Real-World Training with DeepSpeed ==

For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.

1. '''Create a DeepSpeed configuration file (e.g., `ds_config.json`):'''
<pre>nano ds_config.json</pre>

2. '''Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options:'''
```json
{
"fp16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_optimization": {
"stage": 1
},
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 10
}
}
```

3. '''Modify your training script (e.g., `my_training_script.py`) to load this configuration:'''
```python
import deepspeed
import torch

# ... your model definition, dataset loading, etc. ...

model = YourModel(...)
optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config
lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
args=your_arg_parser_object, # If you use argparse
model=model,
model_parameters=model.parameters(),
config_params="ds_config.json" # Path to your config file
)

# ... your training loop ...
# Use model_engine for forward, backward, and optimizer steps
```

4. '''Launch your training script:'''
<pre>deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.json</pre>
Replace `N` with the number of GPUs you want to use.

== Troubleshooting ==

* '''`torch.distributed.errors.BackendError: NCCL error`''':
* '''Cause:''' Often due to network issues between nodes (if distributed across multiple machines), incorrect CUDA/NCCL versions, or insufficient GPU memory.
* '''Solution:'''
* Ensure all nodes can communicate with each other on the specified ports.
* Verify that your PyTorch and CUDA versions are compatible with your NVIDIA drivers and NCCL.
* Reduce batch size or enable DeepSpeed's memory optimization features (like ZeRO).
* '''`RuntimeError: CUDA error: out of memory`''':
* '''Cause:''' Your model or batch size is too large for the GPU memory.
* '''Solution:'''
* Reduce the `train_batch_size` or `train_micro_batch_size_per_gpu` in your DeepSpeed config.
* Use `gradient_accumulation_steps` to simulate larger batch sizes.
* Enable `fp16` training in your DeepSpeed config.
* Consider using DeepSpeed's ZeRO optimization stages (stage 2 or 3) for more aggressive memory savings.
* '''Process hangs or deadlocks:'''
* '''Cause:''' Incorrect initialization of the process group, or issues with `torch.multiprocessing`.
* '''Solution:'''
* Double-check that `MASTER_ADDR` and `MASTER_PORT` are correctly set and accessible by all processes.
* Ensure that `dist.init_process_group` is called by all processes.
* For multi-node setups, ensure `RANK` and `WORLD_SIZE` environment variables are correctly set on each node.
* '''Performance is lower than expected:'''
* '''Cause:''' Bottlenecks in data loading, inefficient model architecture, or suboptimal DeepSpeed configuration.
* '''Solution:'''
* Profile your data loading pipeline.
* Experiment with different DeepSpeed configuration parameters (e.g., `zero_optimization`, `offload`).
* Ensure your `train_micro_batch_size_per_gpu` is large enough to keep GPUs utilized.

== Further Reading ==
* [[PyTorch Distributed Overview]]
* [[NVIDIA CUDA Installation Guide]]
* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)
* [PyTorch DistributedDataParallel Documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)

[[Category:AI and GPU]]
[[Category:Distributed Computing]]
[[Category:Deep Learning]]

{{Exchange Box}}

Multi-GPU Training Setup - Revision history

Admin: New server guide