AI in Cambridge

From Server rental store
Revision as of 04:53, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. AI in Cambridge: Server Configuration

This document details the server configuration for the "AI in Cambridge" project, providing a technical overview for system administrators and developers. This project focuses on providing computational resources for advanced machine learning research within the University of Cambridge. This guide is intended for newcomers to the wiki and assumes a basic understanding of server administration. Please refer to Help:Contents for more general information on using this wiki.

Overview

The "AI in Cambridge" infrastructure consists of a cluster of high-performance servers dedicated to training and deploying artificial intelligence models. The core of the system revolves around GPU-accelerated computing, coupled with a high-bandwidth network and a robust storage solution. We utilize a hybrid cloud approach, leveraging both on-premise hardware and cloud resources from Amazon Web Services. Understanding the Network Topology is critical for troubleshooting. This setup allows for scalability and cost-effectiveness.

Hardware Specifications

The primary compute nodes are built around the following specifications. Note that minor variations may exist between individual servers, but the following represents the core configuration.

Component Specification
CPU Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM 512 GB DDR4 ECC Registered 3200MHz
GPU 8x NVIDIA A100 80GB
Storage (Local) 2x 4TB NVMe SSD (RAID 1) - OS & Temporary Data
Network Interface 2x 200Gbps InfiniBand
Power Supply 3000W Redundant Power Supplies

These specifications are detailed in the Hardware Inventory document. Regular System Monitoring is performed to ensure optimal performance.

Software Stack

The servers run a customized Linux distribution based on Ubuntu Server 22.04 LTS. The software stack is designed for ease of use and compatibility with popular machine learning frameworks.

Software Version
Operating System Ubuntu Server 22.04 LTS
CUDA Toolkit 12.2
cuDNN 8.9.2
NVIDIA Driver 535.104.05
Python 3.10
TensorFlow 2.13.0
PyTorch 2.0.1
Horovod 0.26.1

All software is managed using Ansible for automated deployment and configuration. Please refer to the Software Repository for detailed package lists. We also utilize Docker for containerization of applications.

Storage Configuration

A distributed file system is employed to provide high-performance and scalable storage for datasets and model checkpoints. The system utilizes Lustre as the primary file system.

Component Specification
File System Lustre 2.12.120
Storage Nodes 12x 16TB SAS HDDs (RAID 6 per node)
Total Storage Capacity ~120 TB usable
Network 100Gbps Ethernet
Metadata Server Dual Metadata Servers (Active/Passive)

Data backups are performed nightly to a separate offsite location, following the Backup Policy. Access to the storage system is controlled via User Authentication. Consider reviewing the Data Security protocols.


Networking

The servers are interconnected using a high-bandwidth, low-latency InfiniBand network. This network is critical for distributed training of large models. The network is segmented into separate VLANs for management, storage, and compute traffic. The Firewall Configuration is regularly reviewed for security vulnerabilities.

Security Considerations

Security is a paramount concern. The servers are protected by a multi-layered security approach, including firewalls, intrusion detection systems, and regular security audits. All access to the servers is controlled via SSH with key-based authentication. Please adhere to the Security Best Practices when working with the infrastructure. We also use Two-Factor Authentication for privileged accounts.

Future Expansion

We plan to expand the cluster in the near future with additional GPU servers and increased storage capacity. This expansion will be documented in the Future Development section.



Help:Editing Help:Formatting Special:Search Main Page Project:AI in Cambridge Network Administration System Administration Troubleshooting Guide Change Management Documentation Index User Guides API Documentation Security Policy Disaster Recovery Monitoring Tools Contact Us Internal Wiki Links


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️