Server rental store

AI in Manchester

AI in Manchester: Server Configuration

Welcome to the documentation for the "AI in Manchester" server cluster. This article details the hardware and software configuration powering our Artificial Intelligence initiatives within the Manchester region. This guide is intended for new system administrators and developers joining the project. It provides a detailed overview of the server infrastructure, including hardware specifications, software stack, and networking details. Please review this document carefully before making any changes to the system.

Overview

The "AI in Manchester" project utilizes a distributed server cluster to handle the computational demands of machine learning model training and inference. The cluster is geographically located within a secure data centre in central Manchester. It is comprised of a mix of high-performance compute nodes, storage servers, and network infrastructure. This allows us to efficiently process large datasets and deploy AI models at scale. We utilize a hybrid cloud approach, leveraging on-premise resources for sensitive data and cloud bursting for peak demand. This setup is detailed in Data Security Protocols.

Hardware Configuration

The server cluster consists of the following primary hardware components. Detailed specifications for each node type are provided in the tables below. All servers are rack-mounted and utilize redundant power supplies and cooling systems for high availability. See also Power Redundancy.

Compute Nodes

These nodes are responsible for the core AI processing tasks. They are equipped with powerful GPUs and large amounts of RAM.

Component Specification
CPU Dual Intel Xeon Gold 6338 (32 Cores/64 Threads per CPU)
RAM 512GB DDR4 ECC Registered 3200MHz
GPU 4x NVIDIA A100 80GB PCIe 4.0
Storage (Local) 2TB NVMe PCIe 4.0 SSD (OS & Temp Data)
Network Interface Dual 200Gbps InfiniBand

We currently have 24 compute nodes, managed through Slurm Workload Manager. Regular hardware health checks are performed as outlined in Server Maintenance Schedule.

Storage Servers

These servers provide persistent storage for datasets, model checkpoints, and other critical data.

Component Specification
CPU Dual Intel Xeon Silver 4310 (12 Cores/24 Threads per CPU)
RAM 256GB DDR4 ECC Registered 3200MHz
Storage (Raw) 16 x 18TB SAS 7.2K RPM HDDs (RAID 6) - Total 200TB usable
Network Interface Dual 100Gbps Ethernet
File System Ceph

The storage servers utilize a Ceph distributed file system for scalability and resilience. See Ceph Configuration Guide for more information. A dedicated backup system is detailed in Backup and Disaster Recovery.

Network Infrastructure

The network infrastructure provides high-bandwidth, low-latency connectivity between the servers.

Component Specification
Core Switches Arista 7050X Series
Interconnect 400Gbps Fiber Optic
Network Topology Clos Network
Firewall Palo Alto Networks PA-820

Network security is paramount. Refer to Network Security Policy for detailed information.

Software Configuration

The "AI in Manchester" cluster runs a customized Linux distribution based on Ubuntu 22.04 LTS. The following software components are installed on each node.

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️