Join our Telegram: @serverrental_wiki | BTC Analysis | Trading Signals | Telegraph
Multi-GPU Training Setup
Multi-GPU Training Setup
This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.
Prerequisites
- A Linux server with multiple NVIDIA GPUs: For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like Immers Cloud, which offers competitive pricing.
- NVIDIA Drivers and CUDA Toolkit: Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:
nvidia-smi
nvcc --version
- Python 3 and pip: A working Python 3 installation is required.
- SSH access: You'll need SSH access to your server with a user that has sufficient privileges to install software.
- Basic Linux command-line knowledge: Familiarity with navigating the file system, installing packages, and running commands.
- (Optional) MPI (Message Passing Interface): While not strictly required for PyTorch DDP itself, MPI can be useful for certain advanced distributed computing scenarios and is often a dependency for other distributed training frameworks. You can install it using:
sudo apt update && sudo apt install libopenmpi-dev openmpi-bin openmpi-common
(For Debian/Ubuntu-based systems. Use `yum` or `dnf` for RHEL/CentOS/Fedora.)
Step 1: Install PyTorch with Distributed Support
PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.
1. Update pip:
pip install --upgrade pip
2. Install PyTorch: Visit the official PyTorch website ([1]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Note: Always refer to the official PyTorch website for the most up-to-date installation commands.
3. Verify installation:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
This should output your PyTorch version and `True` if CUDA is available.
Step 2: Set up a Simple DDP Example
Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).
1. Create a Python script (e.g., `ddp_test.py`):
nano ddp_test.py
2. Paste the following code into the file:
```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp
def run(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355' # Use a free port
dist.init_process_group("nccl", rank=rank, world_size=world_size)
print(f"Rank {rank}/{world_size} initialized.")
# Simple model
model = nn.Linear(10, 10).to(rank)
# Wrap the model with DDP
ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
# Dummy data
dummy_input = torch.randn(20, 10).to(rank)
labels = torch.randn(20, 10).to(rank)
# Forward and backward pass
outputs = ddp_model(dummy_input)
loss_fn = nn.MSELoss()
loss = loss_fn(outputs, labels)
loss.backward()
print(f"Rank {rank} loss: {loss.item()}")
dist.destroy_process_group()
if __name__ == "__main__":
world_size = torch.cuda.device_count() # Use all available GPUs
print(f"Using {world_size} GPUs.")
mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
```
3. Save and exit nano (Ctrl+X, Y, Enter).
4. Run the script:
python ddp_test.py
You should see output indicating that each rank (GPU) has initialized and printed its loss.
Step 3: Installing DeepSpeed
DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.
1. Install DeepSpeed:
pip install deepspeed
2. Verify DeepSpeed installation:
python -c "import deepspeed; print(deepspeed.__version__)"
Step 4: DeepSpeed Example with DDP
DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.
1. Create a Python script (e.g., `deepspeed_test.py`):
nano deepspeed_test.py
2. Paste the following code into the file:
```python import os import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp import deepspeed
def run(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12356' # Use a different free port
dist.init_process_group("nccl", rank=rank, world_size=world_size)
print(f"Rank {rank}/{world_size} initialized.")
# Simple model
model = nn.Linear(10, 10).to(rank)
# Dummy data
dummy_input = torch.randn(20, 10).to(rank)
labels = torch.randn(20, 10).to(rank)
# DeepSpeed configuration (minimal example)
ds_config = {
"train_batch_size": 20,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001
}
},
"fp16": {
"enabled": True
}
}
# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
model_parameters=model.parameters(),
config_params=ds_config
)
# Forward and backward pass
outputs = model_engine(dummy_input)
loss_fn = nn.MSELoss()
loss = loss_fn(outputs, labels)
model_engine.backward(loss)
optimizer.step()
optimizer.zero_grad()
print(f"Rank {rank} loss: {loss.item()}")
dist.destroy_process_group()
if __name__ == "__main__":
world_size = torch.cuda.device_count()
print(f"Using {world_size} GPUs.")
mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
```
3. Save and exit nano (Ctrl+X, Y, Enter).
4. Run the script using the DeepSpeed launcher:
deepspeed deepspeed_test.py
This command automatically handles launching the script on each GPU.
Step 5: Real-World Training with DeepSpeed
For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.
1. Create a DeepSpeed configuration file (e.g., `ds_config.json`):
nano ds_config.json
2. Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options:
```json
{
"fp16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_optimization": {
"stage": 1
},
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 10
}
}
```
3. Modify your training script (e.g., `my_training_script.py`) to load this configuration:
```python import deepspeed import torch
# ... your model definition, dataset loading, etc. ...
model = YourModel(...) optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it
# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
args=your_arg_parser_object, # If you use argparse
model=model,
model_parameters=model.parameters(),
config_params="ds_config.json" # Path to your config file
)
# ... your training loop ... # Use model_engine for forward, backward, and optimizer steps ```
4. Launch your training script:
deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.json
Replace `N` with the number of GPUs you want to use.
Troubleshooting
- `torch.distributed.errors.BackendError: NCCL error`:
* Cause: Often due to network issues between nodes (if distributed across multiple machines), incorrect CUDA/NCCL versions, or insufficient GPU memory.
* Solution:
* Ensure all nodes can communicate with each other on the specified ports.
* Verify that your PyTorch and CUDA versions are compatible with your NVIDIA drivers and NCCL.
* Reduce batch size or enable DeepSpeed's memory optimization features (like ZeRO).
- `RuntimeError: CUDA error: out of memory`:
* Cause: Your model or batch size is too large for the GPU memory.
* Solution:
* Reduce the `train_batch_size` or `train_micro_batch_size_per_gpu` in your DeepSpeed config.
* Use `gradient_accumulation_steps` to simulate larger batch sizes.
* Enable `fp16` training in your DeepSpeed config.
* Consider using DeepSpeed's ZeRO optimization stages (stage 2 or 3) for more aggressive memory savings.
- Process hangs or deadlocks:
* Cause: Incorrect initialization of the process group, or issues with `torch.multiprocessing`.
* Solution:
* Double-check that `MASTER_ADDR` and `MASTER_PORT` are correctly set and accessible by all processes.
* Ensure that `dist.init_process_group` is called by all processes.
* For multi-node setups, ensure `RANK` and `WORLD_SIZE` environment variables are correctly set on each node.
- Performance is lower than expected:
* Cause: Bottlenecks in data loading, inefficient model architecture, or suboptimal DeepSpeed configuration.
* Solution:
* Profile your data loading pipeline.
* Experiment with different DeepSpeed configuration parameters (e.g., `zero_optimization`, `offload`).
* Ensure your `train_micro_batch_size_per_gpu` is large enough to keep GPUs utilized.
Further Reading
- PyTorch Distributed Overview
- NVIDIA CUDA Installation Guide
- [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)
- [PyTorch DistributedDataParallel Documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)