AI in Cardiff
- AI in Cardiff: Server Configuration
This document details the server configuration supporting the "AI in Cardiff" project. It is intended for new system administrators and developers working with this infrastructure. The project leverages a cluster of servers to deliver machine learning services and data analysis capabilities. Understanding the specifics of these servers is crucial for effective maintenance, troubleshooting, and future scaling. Please review the System Administration Guide before making any changes.
Overview
The "AI in Cardiff" project relies on a hybrid server infrastructure, utilizing both physical servers and virtual machines. The physical servers handle computationally intensive tasks, while virtual machines provide flexibility for development, testing, and less demanding workloads. All servers are located within the secure data center at Cardiff University. See the Data Center Access Policy for details. The network topology is documented in the Network Diagram.
Physical Server Specifications
The core of the AI processing power comes from three dedicated physical servers, named 'Alys', 'Rhys', and 'Idris'. These servers are optimized for GPU-accelerated computing.
Server Name | CPU | RAM | GPU | Storage |
---|---|---|---|---|
Alys | 2 x Intel Xeon Gold 6248R (24 cores/48 threads per CPU) | 512 GB DDR4 ECC REG | 4 x NVIDIA A100 (80GB) | 8 x 4TB NVMe SSD (RAID 0) |
Rhys | 2 x Intel Xeon Gold 6248R (24 cores/48 threads per CPU) | 512 GB DDR4 ECC REG | 4 x NVIDIA A100 (80GB) | 8 x 4TB NVMe SSD (RAID 0) |
Idris | 2 x AMD EPYC 7763 (64 cores/128 threads per CPU) | 1TB DDR4 ECC REG | 8 x NVIDIA A100 (80GB) | 8 x 8TB NVMe SSD (RAID 0) |
These servers operate on a custom-built Linux distribution based on Ubuntu Server 22.04. The servers are connected via a 100Gbps InfiniBand network. Refer to the InfiniBand Configuration Guide for network details. Monitoring is handled by Prometheus and Grafana.
Virtual Machine Configuration
A cluster of virtual machines (VMs) is managed using Proxmox VE. These VMs are used for a variety of tasks, including development, testing, and data pre-processing.
VM Name | CPU | RAM | Storage | Operating System |
---|---|---|---|---|
dev-1 | 8 vCPUs | 32 GB | 500 GB SSD | Ubuntu 22.04 |
test-1 | 4 vCPUs | 16 GB | 250 GB SSD | Ubuntu 22.04 |
data-prep-1 | 16 vCPUs | 64 GB | 1TB SSD | CentOS 7 |
model-serving-1 | 8 vCPUs | 32 GB | 500 GB SSD | Ubuntu 22.04 |
Each VM has access to the same InfiniBand network as the physical servers, allowing for high-speed data transfer. VMs are backed up daily using the Backup and Recovery Plan.
Software Stack
The software stack deployed on these servers is critical to the project's success. Key components include:
- Python 3.9: Used for all machine learning and data analysis tasks.
- TensorFlow 2.8: The primary machine learning framework.
- PyTorch 1.12: An alternative machine learning framework.
- CUDA Toolkit 11.7: For GPU acceleration.
- Docker: Used for containerizing applications and ensuring reproducibility. See the Docker Best Practices document.
- Kubernetes: Orchestrates the deployment and scaling of containerized applications.
- PostgreSQL: Used for data storage and management. The database schema is documented in the Database Documentation.
- Jupyter Notebook: Used for interactive data analysis and model development.
- Git: Version control system for all code. All code is stored in the Project Repository.
Security Considerations
Security is paramount. All servers are protected by a firewall and intrusion detection system. Access is restricted to authorized personnel only. Regular security audits are conducted, as outlined in the Security Policy. All data is encrypted at rest and in transit. See the Encryption Protocol document for details.
Future Expansion
The "AI in Cardiff" project is expected to grow significantly in the coming years. Plans are underway to add additional physical servers and virtual machines. The next phase of expansion will focus on increasing GPU capacity and storage capacity. The Capacity Planning Document outlines the details of this expansion.
Server Monitoring Troubleshooting Guide Software Updates User Accounts Incident Response Plan
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️