How to Set Up a GPU Server for AI Training

From Server rental store
Revision as of 16:04, 12 April 2026 by Admin (talk | contribs) (New guide article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to set up a GPU server for AI training — this hands-on guide walks you through configuring a GPU server for deep learning with CUDA, PyTorch, and TensorFlow. For hardware selection guidance, see GPU Servers for Machine Learning and AI.

Prerequisites

  • A GPU server with NVIDIA GPU (H100, A100, RTX 4090, or similar)
  • Ubuntu 22.04 or 24.04 LTS (recommended for best driver support)
  • Root or sudo access

Immers Cloud offers GPU servers with pre-installed NVIDIA drivers and CUDA, which can save significant setup time.

Step 1: Install NVIDIA Drivers

Check your GPU:

lspci | grep -i nvidia

Install the latest drivers:

sudo apt update
sudo apt install -y nvidia-driver-550
sudo reboot

Verify after reboot:

nvidia-smi

You should see your GPU model, driver version, and CUDA version.

Step 2: Install CUDA Toolkit

Download and install CUDA 12.x:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-4

Add to your PATH:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify:

nvcc --version

Step 3: Install cuDNN

cuDNN accelerates neural network operations:

sudo apt install -y libcudnn8 libcudnn8-dev

Step 4: Set Up Python Environment

Use Miniconda for isolated environments:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
~/miniconda3/bin/conda init bash
source ~/.bashrc

Create a dedicated environment:

conda create -n ml python=3.11 -y
conda activate ml

Step 5: Install PyTorch

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Verify GPU access:

python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

Step 6: Install TensorFlow

pip install tensorflow[and-cuda]

Verify:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Step 7: Optimize for Training

Enable Mixed Precision

Mixed precision (FP16/BF16) doubles training speed with minimal accuracy loss:

PyTorch:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

TensorFlow:

from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

Monitor GPU Usage

watch -n 1 nvidia-smi

Or install nvitop for a better interface:

pip install nvitop
nvitop

Step 8: Multi-GPU Training

For servers with multiple GPUs:

PyTorch Distributed Data Parallel:

torchrun --nproc_per_node=4 train.py

TensorFlow MirroredStrategy:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = create_model()

Common Troubleshooting

Issue Solution
CUDA out of memory Reduce batch size, enable gradient checkpointing
Driver/CUDA version mismatch Check compatibility matrix on NVIDIA website
Slow training speed Enable mixed precision, check data loading bottleneck
GPU not detected Verify driver with nvidia-smi, check PCIe seating

See Also