How to Set Up a GPU Server for AI Training
How to set up a GPU server for AI training — this hands-on guide walks you through configuring a GPU server for deep learning with CUDA, PyTorch, and TensorFlow. For hardware selection guidance, see GPU Servers for Machine Learning and AI.
Prerequisites
- A GPU server with NVIDIA GPU (H100, A100, RTX 4090, or similar)
- Ubuntu 22.04 or 24.04 LTS (recommended for best driver support)
- Root or sudo access
- GPU Servers for Machine Learning and AI
- Choosing the Right Dedicated Server
- Linux Server Administration Guide
Immers Cloud offers GPU servers with pre-installed NVIDIA drivers and CUDA, which can save significant setup time.
Step 1: Install NVIDIA Drivers
Check your GPU:
lspci
Install the latest drivers:
sudo apt update sudo apt install -y nvidia-driver-550 sudo reboot
Verify after reboot:
nvidia-smi
You should see your GPU model, driver version, and CUDA version.
Step 2: Install CUDA Toolkit
Download and install CUDA 12.x:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install -y cuda-toolkit-12-4
Add to your PATH:
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc
Verify:
nvcc --version
Step 3: Install cuDNN
cuDNN accelerates neural network operations:
sudo apt install -y libcudnn8 libcudnn8-dev
Step 4: Set Up Python Environment
Use Miniconda for isolated environments:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b ~/miniconda3/bin/conda init bash source ~/.bashrc
Create a dedicated environment:
conda create -n ml python=3.11 -y conda activate ml
Step 5: Install PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Verify GPU access:
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
Step 6: Install TensorFlow
pip install tensorflow[and-cuda]
Verify:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Step 7: Optimize for Training
Enable Mixed Precision
Mixed precision (FP16/BF16) doubles training speed with minimal accuracy loss:
PyTorch:
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output = model(input) loss = criterion(output, target)
TensorFlow:
from tensorflow.keras import mixed_precision mixed_precision.set_global_policy('mixed_float16')
Monitor GPU Usage
watch -n 1 nvidia-smi
Or install nvitop for a better interface:
pip install nvitop nvitop
Step 8: Multi-GPU Training
For servers with multiple GPUs:
PyTorch Distributed Data Parallel:
torchrun --nproc_per_node=4 train.py
TensorFlow MirroredStrategy:
strategy = tf.distribute.MirroredStrategy() with strategy.scope(): model = create_model()
Common Troubleshooting
| Issue !! Solution |
|---|
| CUDA out of memory || Reduce batch size, enable gradient checkpointing |
| Driver/CUDA version mismatch || Check compatibility matrix on NVIDIA website |
| Slow training speed || Enable mixed precision, check data loading bottleneck |
| GPU not detected || Verify driver with nvidia-smi, check PCIe seating |