<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://serverrental.store/index.php?action=history&amp;feed=atom&amp;title=Multi-GPU_Training_Setup</id>
	<title>Multi-GPU Training Setup - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://serverrental.store/index.php?action=history&amp;feed=atom&amp;title=Multi-GPU_Training_Setup"/>
	<link rel="alternate" type="text/html" href="https://serverrental.store/index.php?title=Multi-GPU_Training_Setup&amp;action=history"/>
	<updated>2026-04-14T23:04:52Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.36.1</generator>
	<entry>
		<id>https://serverrental.store/index.php?title=Multi-GPU_Training_Setup&amp;diff=5788&amp;oldid=prev</id>
		<title>Admin: New server guide</title>
		<link rel="alternate" type="text/html" href="https://serverrental.store/index.php?title=Multi-GPU_Training_Setup&amp;diff=5788&amp;oldid=prev"/>
		<updated>2026-04-13T10:00:11Z</updated>

		<summary type="html">&lt;p&gt;New server guide&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Multi-GPU Training Setup =&lt;br /&gt;
&lt;br /&gt;
This guide outlines how to set up a Linux server for distributed training using PyTorch's Distributed Data Parallel (DDP) and DeepSpeed. This is essential for accelerating deep learning model training by leveraging multiple GPUs.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
* '''A Linux server with multiple NVIDIA GPUs:''' For efficient distributed training, having multiple GPUs is crucial. You can rent powerful GPU servers from providers like [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud], which offers competitive pricing.&lt;br /&gt;
* '''NVIDIA Drivers and CUDA Toolkit:''' Ensure that the NVIDIA drivers and the appropriate CUDA Toolkit version are installed and configured correctly on your server. You can verify this with:&lt;br /&gt;
&amp;lt;pre&amp;gt;nvidia-smi&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;nvcc --version&amp;lt;/pre&amp;gt;&lt;br /&gt;
* '''Python 3 and pip:''' A working Python 3 installation is required.&lt;br /&gt;
* '''SSH access:''' You'll need SSH access to your server with a user that has sufficient privileges to install software.&lt;br /&gt;
* '''Basic Linux command-line knowledge:''' Familiarity with navigating the file system, installing packages, and running commands.&lt;br /&gt;
* '''(Optional) MPI (Message Passing Interface):''' While not strictly required for PyTorch DDP itself, MPI can be useful for certain advanced distributed computing scenarios and is often a dependency for other distributed training frameworks. You can install it using:&lt;br /&gt;
&amp;lt;pre&amp;gt;sudo apt update &amp;amp;amp;&amp;amp;amp; sudo apt install libopenmpi-dev openmpi-bin openmpi-common&amp;lt;/pre&amp;gt;&lt;br /&gt;
(For Debian/Ubuntu-based systems. Use `yum` or `dnf` for RHEL/CentOS/Fedora.)&lt;br /&gt;
&lt;br /&gt;
== Step 1: Install PyTorch with Distributed Support ==&lt;br /&gt;
&lt;br /&gt;
PyTorch needs to be installed with support for distributed training. This is typically handled by installing the correct version of PyTorch that includes CUDA support.&lt;br /&gt;
&lt;br /&gt;
1. '''Update pip:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;pip install --upgrade pip&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. '''Install PyTorch:''' Visit the official PyTorch website ([https://pytorch.org/get-started/locally/]) to find the correct installation command for your CUDA version. For example, if you have CUDA 11.8 installed, the command might look like this:&lt;br /&gt;
   &amp;lt;pre&amp;gt;pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118&amp;lt;/pre&amp;gt;&lt;br /&gt;
   '''Note:''' Always refer to the official PyTorch website for the most up-to-date installation commands.&lt;br /&gt;
&lt;br /&gt;
3. '''Verify installation:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;python -c &amp;quot;import torch; print(torch.__version__); print(torch.cuda.is_available())&amp;quot;&amp;lt;/pre&amp;gt;&lt;br /&gt;
   This should output your PyTorch version and `True` if CUDA is available.&lt;br /&gt;
&lt;br /&gt;
== Step 2: Set up a Simple DDP Example ==&lt;br /&gt;
&lt;br /&gt;
Let's create a basic script to test PyTorch's Distributed Data Parallel (DDP).&lt;br /&gt;
&lt;br /&gt;
1. '''Create a Python script (e.g., `ddp_test.py`):'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;nano ddp_test.py&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. '''Paste the following code into the file:'''&lt;br /&gt;
   ```python&lt;br /&gt;
   import os&lt;br /&gt;
   import torch&lt;br /&gt;
   import torch.nn as nn&lt;br /&gt;
   import torch.distributed as dist&lt;br /&gt;
   import torch.multiprocessing as mp&lt;br /&gt;
&lt;br /&gt;
   def run(rank, world_size):&lt;br /&gt;
       os.environ['MASTER_ADDR'] = 'localhost'&lt;br /&gt;
       os.environ['MASTER_PORT'] = '12355' # Use a free port&lt;br /&gt;
&lt;br /&gt;
       dist.init_process_group(&amp;quot;nccl&amp;quot;, rank=rank, world_size=world_size)&lt;br /&gt;
       print(f&amp;quot;Rank {rank}/{world_size} initialized.&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
       # Simple model&lt;br /&gt;
       model = nn.Linear(10, 10).to(rank)&lt;br /&gt;
       # Wrap the model with DDP&lt;br /&gt;
       ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])&lt;br /&gt;
&lt;br /&gt;
       # Dummy data&lt;br /&gt;
       dummy_input = torch.randn(20, 10).to(rank)&lt;br /&gt;
       labels = torch.randn(20, 10).to(rank)&lt;br /&gt;
&lt;br /&gt;
       # Forward and backward pass&lt;br /&gt;
       outputs = ddp_model(dummy_input)&lt;br /&gt;
       loss_fn = nn.MSELoss()&lt;br /&gt;
       loss = loss_fn(outputs, labels)&lt;br /&gt;
       loss.backward()&lt;br /&gt;
&lt;br /&gt;
       print(f&amp;quot;Rank {rank} loss: {loss.item()}&amp;quot;)&lt;br /&gt;
       dist.destroy_process_group()&lt;br /&gt;
&lt;br /&gt;
   if __name__ == &amp;quot;__main__&amp;quot;:&lt;br /&gt;
       world_size = torch.cuda.device_count() # Use all available GPUs&lt;br /&gt;
       print(f&amp;quot;Using {world_size} GPUs.&amp;quot;)&lt;br /&gt;
       mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)&lt;br /&gt;
   ```&lt;br /&gt;
&lt;br /&gt;
3. '''Save and exit nano''' (Ctrl+X, Y, Enter).&lt;br /&gt;
&lt;br /&gt;
4. '''Run the script:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;python ddp_test.py&amp;lt;/pre&amp;gt;&lt;br /&gt;
   You should see output indicating that each rank (GPU) has initialized and printed its loss.&lt;br /&gt;
&lt;br /&gt;
== Step 3: Installing DeepSpeed ==&lt;br /&gt;
&lt;br /&gt;
DeepSpeed is a deep learning optimization library that further enhances distributed training efficiency, especially for large models.&lt;br /&gt;
&lt;br /&gt;
1. '''Install DeepSpeed:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;pip install deepspeed&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. '''Verify DeepSpeed installation:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;python -c &amp;quot;import deepspeed; print(deepspeed.__version__)&amp;quot;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Step 4: DeepSpeed Example with DDP ==&lt;br /&gt;
&lt;br /&gt;
DeepSpeed integrates seamlessly with PyTorch DDP. Here's a modified example.&lt;br /&gt;
&lt;br /&gt;
1. '''Create a Python script (e.g., `deepspeed_test.py`):'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;nano deepspeed_test.py&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. '''Paste the following code into the file:'''&lt;br /&gt;
   ```python&lt;br /&gt;
   import os&lt;br /&gt;
   import torch&lt;br /&gt;
   import torch.nn as nn&lt;br /&gt;
   import torch.distributed as dist&lt;br /&gt;
   import torch.multiprocessing as mp&lt;br /&gt;
   import deepspeed&lt;br /&gt;
&lt;br /&gt;
   def run(rank, world_size):&lt;br /&gt;
       os.environ['MASTER_ADDR'] = 'localhost'&lt;br /&gt;
       os.environ['MASTER_PORT'] = '12356' # Use a different free port&lt;br /&gt;
&lt;br /&gt;
       dist.init_process_group(&amp;quot;nccl&amp;quot;, rank=rank, world_size=world_size)&lt;br /&gt;
       print(f&amp;quot;Rank {rank}/{world_size} initialized.&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
       # Simple model&lt;br /&gt;
       model = nn.Linear(10, 10).to(rank)&lt;br /&gt;
&lt;br /&gt;
       # Dummy data&lt;br /&gt;
       dummy_input = torch.randn(20, 10).to(rank)&lt;br /&gt;
       labels = torch.randn(20, 10).to(rank)&lt;br /&gt;
&lt;br /&gt;
       # DeepSpeed configuration (minimal example)&lt;br /&gt;
       ds_config = {&lt;br /&gt;
           &amp;quot;train_batch_size&amp;quot;: 20,&lt;br /&gt;
           &amp;quot;optimizer&amp;quot;: {&lt;br /&gt;
               &amp;quot;type&amp;quot;: &amp;quot;Adam&amp;quot;,&lt;br /&gt;
               &amp;quot;params&amp;quot;: {&lt;br /&gt;
                   &amp;quot;lr&amp;quot;: 0.001&lt;br /&gt;
               }&lt;br /&gt;
           },&lt;br /&gt;
           &amp;quot;fp16&amp;quot;: {&lt;br /&gt;
               &amp;quot;enabled&amp;quot;: True&lt;br /&gt;
           }&lt;br /&gt;
       }&lt;br /&gt;
&lt;br /&gt;
       # Initialize DeepSpeed&lt;br /&gt;
       model_engine, optimizer, _, _ = deepspeed.initialize(&lt;br /&gt;
           model=model,&lt;br /&gt;
           model_parameters=model.parameters(),&lt;br /&gt;
           config_params=ds_config&lt;br /&gt;
       )&lt;br /&gt;
&lt;br /&gt;
       # Forward and backward pass&lt;br /&gt;
       outputs = model_engine(dummy_input)&lt;br /&gt;
       loss_fn = nn.MSELoss()&lt;br /&gt;
       loss = loss_fn(outputs, labels)&lt;br /&gt;
       model_engine.backward(loss)&lt;br /&gt;
       optimizer.step()&lt;br /&gt;
       optimizer.zero_grad()&lt;br /&gt;
&lt;br /&gt;
       print(f&amp;quot;Rank {rank} loss: {loss.item()}&amp;quot;)&lt;br /&gt;
       dist.destroy_process_group()&lt;br /&gt;
&lt;br /&gt;
   if __name__ == &amp;quot;__main__&amp;quot;:&lt;br /&gt;
       world_size = torch.cuda.device_count()&lt;br /&gt;
       print(f&amp;quot;Using {world_size} GPUs.&amp;quot;)&lt;br /&gt;
       mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)&lt;br /&gt;
   ```&lt;br /&gt;
&lt;br /&gt;
3. '''Save and exit nano''' (Ctrl+X, Y, Enter).&lt;br /&gt;
&lt;br /&gt;
4. '''Run the script using the DeepSpeed launcher:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;deepspeed deepspeed_test.py&amp;lt;/pre&amp;gt;&lt;br /&gt;
   This command automatically handles launching the script on each GPU.&lt;br /&gt;
&lt;br /&gt;
== Step 5: Real-World Training with DeepSpeed ==&lt;br /&gt;
&lt;br /&gt;
For actual model training, you'll need a more sophisticated DeepSpeed configuration file and a training script.&lt;br /&gt;
&lt;br /&gt;
1. '''Create a DeepSpeed configuration file (e.g., `ds_config.json`):'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;nano ds_config.json&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. '''Paste a sample configuration. This is a basic example; refer to the DeepSpeed documentation for advanced options:'''&lt;br /&gt;
   ```json&lt;br /&gt;
   {&lt;br /&gt;
     &amp;quot;fp16&amp;quot;: {&lt;br /&gt;
       &amp;quot;enabled&amp;quot;: true&lt;br /&gt;
     },&lt;br /&gt;
     &amp;quot;optimizer&amp;quot;: {&lt;br /&gt;
       &amp;quot;type&amp;quot;: &amp;quot;AdamW&amp;quot;,&lt;br /&gt;
       &amp;quot;params&amp;quot;: {&lt;br /&gt;
         &amp;quot;lr&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
         &amp;quot;betas&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
         &amp;quot;eps&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
         &amp;quot;weight_decay&amp;quot;: &amp;quot;auto&amp;quot;&lt;br /&gt;
       }&lt;br /&gt;
     },&lt;br /&gt;
     &amp;quot;scheduler&amp;quot;: {&lt;br /&gt;
       &amp;quot;type&amp;quot;: &amp;quot;WarmupLR&amp;quot;,&lt;br /&gt;
       &amp;quot;params&amp;quot;: {&lt;br /&gt;
         &amp;quot;warmup_min_lr&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
         &amp;quot;warmup_max_lr&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
         &amp;quot;warmup_num_steps&amp;quot;: &amp;quot;auto&amp;quot;&lt;br /&gt;
       }&lt;br /&gt;
     },&lt;br /&gt;
     &amp;quot;train_batch_size&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
     &amp;quot;train_micro_batch_size_per_gpu&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
     &amp;quot;gradient_accumulation_steps&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
     &amp;quot;gradient_clipping&amp;quot;: &amp;quot;auto&amp;quot;,&lt;br /&gt;
     &amp;quot;zero_optimization&amp;quot;: {&lt;br /&gt;
       &amp;quot;stage&amp;quot;: 1&lt;br /&gt;
     },&lt;br /&gt;
     &amp;quot;wall_clock_breakdown&amp;quot;: false,&lt;br /&gt;
     &amp;quot;flops_profiler&amp;quot;: {&lt;br /&gt;
       &amp;quot;enabled&amp;quot;: false,&lt;br /&gt;
       &amp;quot;profile_step&amp;quot;: 10&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
   ```&lt;br /&gt;
&lt;br /&gt;
3. '''Modify your training script (e.g., `my_training_script.py`) to load this configuration:'''&lt;br /&gt;
   ```python&lt;br /&gt;
   import deepspeed&lt;br /&gt;
   import torch&lt;br /&gt;
&lt;br /&gt;
   # ... your model definition, dataset loading, etc. ...&lt;br /&gt;
&lt;br /&gt;
   model = YourModel(...)&lt;br /&gt;
   optimizer = YourOptimizer(...) # Or let DeepSpeed handle it via config&lt;br /&gt;
   lr_scheduler = YourScheduler(...) # Or let DeepSpeed handle it&lt;br /&gt;
&lt;br /&gt;
   # Initialize DeepSpeed&lt;br /&gt;
   model_engine, optimizer, _, _ = deepspeed.initialize(&lt;br /&gt;
       args=your_arg_parser_object, # If you use argparse&lt;br /&gt;
       model=model,&lt;br /&gt;
       model_parameters=model.parameters(),&lt;br /&gt;
       config_params=&amp;quot;ds_config.json&amp;quot; # Path to your config file&lt;br /&gt;
   )&lt;br /&gt;
&lt;br /&gt;
   # ... your training loop ...&lt;br /&gt;
   # Use model_engine for forward, backward, and optimizer steps&lt;br /&gt;
   ```&lt;br /&gt;
&lt;br /&gt;
4. '''Launch your training script:'''&lt;br /&gt;
   &amp;lt;pre&amp;gt;deepspeed --num_gpus=N my_training_script.py --deepspeed ds_config.json&amp;lt;/pre&amp;gt;&lt;br /&gt;
   Replace `N` with the number of GPUs you want to use.&lt;br /&gt;
&lt;br /&gt;
== Troubleshooting ==&lt;br /&gt;
&lt;br /&gt;
* '''`torch.distributed.errors.BackendError: NCCL error`''':&lt;br /&gt;
    * '''Cause:''' Often due to network issues between nodes (if distributed across multiple machines), incorrect CUDA/NCCL versions, or insufficient GPU memory.&lt;br /&gt;
    * '''Solution:'''&lt;br /&gt;
        * Ensure all nodes can communicate with each other on the specified ports.&lt;br /&gt;
        * Verify that your PyTorch and CUDA versions are compatible with your NVIDIA drivers and NCCL.&lt;br /&gt;
        * Reduce batch size or enable DeepSpeed's memory optimization features (like ZeRO).&lt;br /&gt;
* '''`RuntimeError: CUDA error: out of memory`''':&lt;br /&gt;
    * '''Cause:''' Your model or batch size is too large for the GPU memory.&lt;br /&gt;
    * '''Solution:'''&lt;br /&gt;
        * Reduce the `train_batch_size` or `train_micro_batch_size_per_gpu` in your DeepSpeed config.&lt;br /&gt;
        * Use `gradient_accumulation_steps` to simulate larger batch sizes.&lt;br /&gt;
        * Enable `fp16` training in your DeepSpeed config.&lt;br /&gt;
        * Consider using DeepSpeed's ZeRO optimization stages (stage 2 or 3) for more aggressive memory savings.&lt;br /&gt;
* '''Process hangs or deadlocks:'''&lt;br /&gt;
    * '''Cause:''' Incorrect initialization of the process group, or issues with `torch.multiprocessing`.&lt;br /&gt;
    * '''Solution:'''&lt;br /&gt;
        * Double-check that `MASTER_ADDR` and `MASTER_PORT` are correctly set and accessible by all processes.&lt;br /&gt;
        * Ensure that `dist.init_process_group` is called by all processes.&lt;br /&gt;
        * For multi-node setups, ensure `RANK` and `WORLD_SIZE` environment variables are correctly set on each node.&lt;br /&gt;
* '''Performance is lower than expected:'''&lt;br /&gt;
    * '''Cause:''' Bottlenecks in data loading, inefficient model architecture, or suboptimal DeepSpeed configuration.&lt;br /&gt;
    * '''Solution:'''&lt;br /&gt;
        * Profile your data loading pipeline.&lt;br /&gt;
        * Experiment with different DeepSpeed configuration parameters (e.g., `zero_optimization`, `offload`).&lt;br /&gt;
        * Ensure your `train_micro_batch_size_per_gpu` is large enough to keep GPUs utilized.&lt;br /&gt;
&lt;br /&gt;
== Further Reading ==&lt;br /&gt;
* [[PyTorch Distributed Overview]]&lt;br /&gt;
* [[NVIDIA CUDA Installation Guide]]&lt;br /&gt;
* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)&lt;br /&gt;
* [PyTorch DistributedDataParallel Documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)&lt;br /&gt;
&lt;br /&gt;
[[Category:AI and GPU]]&lt;br /&gt;
[[Category:Distributed Computing]]&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
&lt;br /&gt;
{{Exchange Box}}&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
</feed>