AI in Cambridge
- AI in Cambridge: Server Configuration
This document details the server configuration for the "AI in Cambridge" project, providing a technical overview for system administrators and developers. This project focuses on providing computational resources for advanced machine learning research within the University of Cambridge. This guide is intended for newcomers to the wiki and assumes a basic understanding of server administration. Please refer to Help:Contents for more general information on using this wiki.
Overview
The "AI in Cambridge" infrastructure consists of a cluster of high-performance servers dedicated to training and deploying artificial intelligence models. The core of the system revolves around GPU-accelerated computing, coupled with a high-bandwidth network and a robust storage solution. We utilize a hybrid cloud approach, leveraging both on-premise hardware and cloud resources from Amazon Web Services. Understanding the Network Topology is critical for troubleshooting. This setup allows for scalability and cost-effectiveness.
Hardware Specifications
The primary compute nodes are built around the following specifications. Note that minor variations may exist between individual servers, but the following represents the core configuration.
Component | Specification |
---|---|
CPU | Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 512 GB DDR4 ECC Registered 3200MHz |
GPU | 8x NVIDIA A100 80GB |
Storage (Local) | 2x 4TB NVMe SSD (RAID 1) - OS & Temporary Data |
Network Interface | 2x 200Gbps InfiniBand |
Power Supply | 3000W Redundant Power Supplies |
These specifications are detailed in the Hardware Inventory document. Regular System Monitoring is performed to ensure optimal performance.
Software Stack
The servers run a customized Linux distribution based on Ubuntu Server 22.04 LTS. The software stack is designed for ease of use and compatibility with popular machine learning frameworks.
Software | Version |
---|---|
Operating System | Ubuntu Server 22.04 LTS |
CUDA Toolkit | 12.2 |
cuDNN | 8.9.2 |
NVIDIA Driver | 535.104.05 |
Python | 3.10 |
TensorFlow | 2.13.0 |
PyTorch | 2.0.1 |
Horovod | 0.26.1 |
All software is managed using Ansible for automated deployment and configuration. Please refer to the Software Repository for detailed package lists. We also utilize Docker for containerization of applications.
Storage Configuration
A distributed file system is employed to provide high-performance and scalable storage for datasets and model checkpoints. The system utilizes Lustre as the primary file system.
Component | Specification |
---|---|
File System | Lustre 2.12.120 |
Storage Nodes | 12x 16TB SAS HDDs (RAID 6 per node) |
Total Storage Capacity | ~120 TB usable |
Network | 100Gbps Ethernet |
Metadata Server | Dual Metadata Servers (Active/Passive) |
Data backups are performed nightly to a separate offsite location, following the Backup Policy. Access to the storage system is controlled via User Authentication. Consider reviewing the Data Security protocols.
Networking
The servers are interconnected using a high-bandwidth, low-latency InfiniBand network. This network is critical for distributed training of large models. The network is segmented into separate VLANs for management, storage, and compute traffic. The Firewall Configuration is regularly reviewed for security vulnerabilities.
Security Considerations
Security is a paramount concern. The servers are protected by a multi-layered security approach, including firewalls, intrusion detection systems, and regular security audits. All access to the servers is controlled via SSH with key-based authentication. Please adhere to the Security Best Practices when working with the infrastructure. We also use Two-Factor Authentication for privileged accounts.
Future Expansion
We plan to expand the cluster in the near future with additional GPU servers and increased storage capacity. This expansion will be documented in the Future Development section.
Help:Editing Help:Formatting Special:Search Main Page Project:AI in Cambridge Network Administration System Administration Troubleshooting Guide Change Management Documentation Index User Guides API Documentation Security Policy Disaster Recovery Monitoring Tools Contact Us Internal Wiki Links
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️