AI in Leeds
AI in Leeds: Server Configuration
Welcome to the documentation for the "AI in Leeds" server cluster. This article details the hardware and software configuration powering our Artificial Intelligence initiatives within the Leeds data centre. This guide is aimed at newcomers to the wiki and those needing detailed information about the server infrastructure. Understanding this configuration is crucial for System Administrators and Developers working with AI models hosted on this cluster.
Overview
The "AI in Leeds" cluster is a dedicated environment designed to handle the computational demands of machine learning, deep learning, and natural language processing tasks. It comprises a network of high-performance servers interconnected via a low-latency network. The primary goal of this infrastructure is to provide a scalable and reliable platform for research and development in AI. We leverage Red Hat Enterprise Linux as our primary operating system due to its stability and security features. Network configuration is handled centrally, ensuring consistent performance.
Hardware Specifications
The cluster consists of three primary node types: Master Nodes, Compute Nodes, and Storage Nodes. Each node type is configured with specific hardware to optimize its role within the cluster.
Node Type | CPU | Memory | Storage | Network Interface |
---|---|---|---|---|
Master Nodes (2) | 2 x Intel Xeon Gold 6338 | 256 GB DDR4 ECC | 2 x 1 TB NVMe SSD (RAID 1) | 100 Gbps Ethernet |
Compute Nodes (10) | 2 x AMD EPYC 7763 | 512 GB DDR4 ECC | 4 x 4 TB NVMe SSD (RAID 0) | 200 Gbps InfiniBand |
Storage Nodes (3) | 2 x Intel Xeon Silver 4310 | 128 GB DDR4 ECC | 16 x 16 TB SAS HDD (RAID 6) | 100 Gbps Ethernet |
Software Stack
The software stack is designed to provide a robust and flexible environment for AI development. We utilize a combination of open-source tools and proprietary software. Containerization with Docker and Kubernetes is central to our deployment strategy.
Component | Version | Purpose |
---|---|---|
Operating System | Red Hat Enterprise Linux 8.6 | Server Base |
Kubernetes | v1.24.3 | Container Orchestration |
Docker | 20.10.12 | Containerization |
NVIDIA CUDA Toolkit | 11.7 | GPU Programming |
TensorFlow | 2.9.1 | Machine Learning Framework |
PyTorch | 1.12.1 | Deep Learning Framework |
JupyterHub | 3.0.0 | Interactive Computing Environment |
Network Topology
The network is a critical component of the cluster, providing high-bandwidth, low-latency communication between nodes. The network is segmented into three subnets: one for the Master Nodes, one for the Compute Nodes, and one for the Storage Nodes. Firewall configuration is managed centrally to ensure security.
Subnet | IP Range | Nodes |
---|---|---|
Master | 192.168.1.0/24 | Master Node 1, Master Node 2 |
Compute | 192.168.2.0/24 | Compute Node 1 - 10 |
Storage | 192.168.3.0/24 | Storage Node 1 - 3 |
Security Considerations
Security is paramount. We employ multiple layers of security, including:
- Firewall rules to restrict network access.
- Regular security audits and vulnerability scans.
- Strong authentication and authorization mechanisms.
- Data encryption at rest and in transit. Data backup procedures are also in place.
- Intrusion detection systems monitor for malicious activity.
Monitoring and Alerting
The cluster is continuously monitored using Prometheus and Grafana. Alerts are configured to notify administrators of any issues, such as high CPU usage, memory exhaustion, or disk failures. Log analysis is done using the ELK stack. We also use Nagios for basic server monitoring.
Future Enhancements
Planned upgrades include:
- Adding more GPU-accelerated Compute Nodes.
- Implementing a more advanced storage solution with NVMe-oF.
- Integrating with a cloud-based object storage service. Scalability testing will be performed following any hardware changes.
- Exploring the use of serverless computing for certain AI workloads.
Cluster maintenance is scheduled monthly to ensure the ongoing stability and performance of the system. Please refer to the troubleshooting guide for common issues.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️