Server rental store

AI in Durham

# AI in Durham: Server Configuration

This document details the server configuration supporting the "AI in Durham" project. It is intended for new system administrators and developers contributing to the project's infrastructure. This project focuses on providing computational resources for local Artificial Intelligence research and development. It utilizes a hybrid cloud approach, leveraging both on-premise hardware and cloud services.

Overview

The "AI in Durham" project requires a robust and scalable server infrastructure to handle demanding workloads related to machine learning, deep learning, and data science. This infrastructure is designed for high throughput, low latency, and data security. We utilize a combination of GPU servers for training, CPU servers for inference and general processing, and a distributed file system for data storage. Cloud resources are used for burst capacity and specialized services. See Server Infrastructure Overview for a more general description of our server environments.

Hardware Specifications

The core on-premise infrastructure consists of three primary server types: GPU servers, CPU servers, and storage servers.

GPU Servers

These servers are dedicated to computationally intensive tasks like model training.

Specification Value
Model Dell PowerEdge R750xa
CPU 2 x Intel Xeon Gold 6348 (28 cores per CPU)
GPU 4 x NVIDIA A100 (80GB HBM2e)
RAM 512 GB DDR4 ECC REG
Storage 4 x 4TB NVMe PCIe Gen4 SSD (RAID 0)
Network 2 x 100GbE ConnectX-6
Operating System Ubuntu 22.04 LTS

These servers run CUDA Toolkit 12.2 and cuDNN 8.9.2, optimized for deep learning frameworks like TensorFlow and PyTorch. See GPU Server Maintenance for information on monitoring and upkeep.

CPU Servers

These servers handle inference, data pre-processing, and general-purpose computing tasks.

Specification Value
Model Supermicro Super Server 2029U-TR4
CPU 2 x AMD EPYC 7763 (64 cores per CPU)
RAM 1TB DDR4 ECC REG
Storage 8 x 8TB SATA HDD (RAID 6) + 2 x 1TB NVMe SSD (OS)
Network 2 x 25GbE
Operating System CentOS Stream 9

These servers utilize Docker containers for application isolation and portability. For detailed information on containerization practices, see Containerization Best Practices.

Storage Servers

These servers provide centralized storage for datasets and model artifacts.

Specification Value
Model NetApp FAS2750
Storage Capacity 368 TB Raw (Usable varies with RAID configuration)
RAID Level RAID-6
Network 4 x 40GbE InfiniBand
File System ONTAP

Storage is accessed via NFS and SMB protocols. Refer to the Data Storage Policy for details on data backup and recovery procedures.

Software Stack

The software stack is designed for flexibility and scalability. Key components include:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️