Server rental store

AI in Gillingham

AI in Gillingham: Server Configuration Documentation

Welcome to the documentation for the "AI in Gillingham" server deployment. This article aims to provide a comprehensive overview of the hardware and software configuration for this dedicated AI processing cluster. This guide is intended for new administrators and engineers onboarding to the project. Please read this document carefully before making any changes to the system. Refer to our System Administration Guidelines for general operational procedures.

Overview

The "AI in Gillingham" project utilizes a distributed server cluster to perform computationally intensive machine learning tasks, specifically focused on image recognition and natural language processing. The cluster is located within the Gillingham data center and is designed for scalability and high availability. It’s important to consult the Data Center Access Procedures before any physical access is required. The entire system is monitored via Nagios Monitoring System. Our Disaster Recovery Plan details procedures in case of failure.

Hardware Configuration

The cluster consists of a master node and four worker nodes. All nodes are interconnected via a 100Gbps InfiniBand network. Detailed specifications for each node type are provided below.

Master Node

The master node manages task scheduling, data distribution, and overall cluster health. It does *not* participate in the actual AI processing.

Specification Value
CPU Dual Intel Xeon Gold 6338
RAM 256 GB DDR4 ECC Registered
Storage (OS) 1 TB NVMe SSD
Storage (Metadata) 8 TB SAS HDD in RAID 1
Network Interface Dual 100Gbps InfiniBand, Dual 10Gbps Ethernet
Power Supply Redundant 1600W Platinum

Worker Nodes

The worker nodes perform the actual AI processing. They are equipped with powerful GPUs to accelerate computations.

Specification Value
CPU Dual Intel Xeon Silver 4310
RAM 128 GB DDR4 ECC Registered
Storage (OS) 512 GB NVMe SSD
GPU 4x NVIDIA A100 80GB
Network Interface Dual 100Gbps InfiniBand
Power Supply Redundant 1600W Platinum

Network Infrastructure

The network is a critical component of the cluster. All nodes reside on a dedicated VLAN.

Component Specification
Switch Mellanox Spectrum-2
Network Topology Fat Tree
VLAN 192.168.10.0/16
Subnet Mask 255.255.255.0
Gateway 192.168.10.1

Software Configuration

The cluster runs Ubuntu Server 22.04 LTS with a customized kernel for optimized GPU performance. The primary AI framework used is TensorFlow 2.12. Detailed instructions for installing and configuring TensorFlow can be found in the TensorFlow Installation Guide. We leverage Kubernetes for container orchestration and Docker for containerization.

Operating System

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️