AI in Cambridge

AI in Cambridge: Server Configuration

This document details the server configuration for the "AI in Cambridge" project, providing a technical overview for system administrators and developers. This project focuses on providing computational resources for advanced machine learning research within the University of Cambridge. This guide is intended for newcomers to the wiki and assumes a basic understanding of server administration. Please refer to Help:Contents for more general information on using this wiki.

Overview

The "AI in Cambridge" infrastructure consists of a cluster of high-performance servers dedicated to training and deploying artificial intelligence models. The core of the system revolves around GPU-accelerated computing, coupled with a high-bandwidth network and a robust storage solution. We utilize a hybrid cloud approach, leveraging both on-premise hardware and cloud resources from Amazon Web Services. Understanding the Network Topology is critical for troubleshooting. This setup allows for scalability and cost-effectiveness.

Hardware Specifications

The primary compute nodes are built around the following specifications. Note that minor variations may exist between individual servers, but the following represents the core configuration.

Component	Specification
CPU	Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM	512 GB DDR4 ECC Registered 3200MHz
GPU	8x NVIDIA A100 80GB
Storage (Local)	2x 4TB NVMe SSD (RAID 1) - OS & Temporary Data
Network Interface	2x 200Gbps InfiniBand
Power Supply	3000W Redundant Power Supplies

These specifications are detailed in the Hardware Inventory document. Regular System Monitoring is performed to ensure optimal performance.

Software Stack

The servers run a customized Linux distribution based on Ubuntu Server 22.04 LTS. The software stack is designed for ease of use and compatibility with popular machine learning frameworks.

Software	Version
Operating System	Ubuntu Server 22.04 LTS
CUDA Toolkit	12.2
cuDNN	8.9.2
NVIDIA Driver	535.104.05
Python	3.10
TensorFlow	2.13.0
PyTorch	2.0.1
Horovod	0.26.1

All software is managed using Ansible for automated deployment and configuration. Please refer to the Software Repository for detailed package lists. We also utilize Docker for containerization of applications.

Storage Configuration

A distributed file system is employed to provide high-performance and scalable storage for datasets and model checkpoints. The system utilizes Lustre as the primary file system.

Component	Specification
File System	Lustre 2.12.120
Storage Nodes	12x 16TB SAS HDDs (RAID 6 per node)
Total Storage Capacity	~120 TB usable
Network	100Gbps Ethernet
Metadata Server	Dual Metadata Servers (Active/Passive)

Data backups are performed nightly to a separate offsite location, following the Backup Policy. Access to the storage system is controlled via User Authentication. Consider reviewing the Data Security protocols.

Networking

The servers are interconnected using a high-bandwidth, low-latency InfiniBand network. This network is critical for distributed training of large models. The network is segmented into separate VLANs for management, storage, and compute traffic. The Firewall Configuration is regularly reviewed for security vulnerabilities.

Security Considerations

Security is a paramount concern. The servers are protected by a multi-layered security approach, including firewalls, intrusion detection systems, and regular security audits. All access to the servers is controlled via SSH with key-based authentication. Please adhere to the Security Best Practices when working with the infrastructure. We also use Two-Factor Authentication for privileged accounts.

Future Expansion

We plan to expand the cluster in the near future with additional GPU servers and increased storage capacity. This expansion will be documented in the Future Development section.

Help:Editing Help:Formatting Special:Search Main Page Project:AI in Cambridge Network Administration System Administration Troubleshooting Guide Change Management Documentation Index User Guides API Documentation Security Policy Disaster Recovery Monitoring Tools Contact Us Internal Wiki Links

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️