Server rental store

AI in Cardiff

# AI in Cardiff: Server Configuration

This document details the server configuration supporting the "AI in Cardiff" project. It is intended for new system administrators and developers working with this infrastructure. The project leverages a cluster of servers to deliver machine learning services and data analysis capabilities. Understanding the specifics of these servers is crucial for effective maintenance, troubleshooting, and future scaling. Please review the System Administration Guide before making any changes.

Overview

The "AI in Cardiff" project relies on a hybrid server infrastructure, utilizing both physical servers and virtual machines. The physical servers handle computationally intensive tasks, while virtual machines provide flexibility for development, testing, and less demanding workloads. All servers are located within the secure data center at Cardiff University. See the Data Center Access Policy for details. The network topology is documented in the Network Diagram.

Physical Server Specifications

The core of the AI processing power comes from three dedicated physical servers, named 'Alys', 'Rhys', and 'Idris'. These servers are optimized for GPU-accelerated computing.

Server Name CPU RAM GPU Storage
Alys 2 x Intel Xeon Gold 6248R (24 cores/48 threads per CPU) 512 GB DDR4 ECC REG 4 x NVIDIA A100 (80GB) 8 x 4TB NVMe SSD (RAID 0)
Rhys 2 x Intel Xeon Gold 6248R (24 cores/48 threads per CPU) 512 GB DDR4 ECC REG 4 x NVIDIA A100 (80GB) 8 x 4TB NVMe SSD (RAID 0)
Idris 2 x AMD EPYC 7763 (64 cores/128 threads per CPU) 1TB DDR4 ECC REG 8 x NVIDIA A100 (80GB) 8 x 8TB NVMe SSD (RAID 0)

These servers operate on a custom-built Linux distribution based on Ubuntu Server 22.04. The servers are connected via a 100Gbps InfiniBand network. Refer to the InfiniBand Configuration Guide for network details. Monitoring is handled by Prometheus and Grafana.

Virtual Machine Configuration

A cluster of virtual machines (VMs) is managed using Proxmox VE. These VMs are used for a variety of tasks, including development, testing, and data pre-processing.

VM Name CPU RAM Storage Operating System
dev-1 8 vCPUs 32 GB 500 GB SSD Ubuntu 22.04
test-1 4 vCPUs 16 GB 250 GB SSD Ubuntu 22.04
data-prep-1 16 vCPUs 64 GB 1TB SSD CentOS 7
model-serving-1 8 vCPUs 32 GB 500 GB SSD Ubuntu 22.04

Each VM has access to the same InfiniBand network as the physical servers, allowing for high-speed data transfer. VMs are backed up daily using the Backup and Recovery Plan.

Software Stack

The software stack deployed on these servers is critical to the project's success. Key components include:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️