AI in Leeds

From Server rental store
Jump to navigation Jump to search

AI in Leeds: Server Configuration

Welcome to the documentation for the "AI in Leeds" server cluster. This article details the hardware and software configuration powering our Artificial Intelligence initiatives within the Leeds data centre. This guide is aimed at newcomers to the wiki and those needing detailed information about the server infrastructure. Understanding this configuration is crucial for System Administrators and Developers working with AI models hosted on this cluster.

Overview

The "AI in Leeds" cluster is a dedicated environment designed to handle the computational demands of machine learning, deep learning, and natural language processing tasks. It comprises a network of high-performance servers interconnected via a low-latency network. The primary goal of this infrastructure is to provide a scalable and reliable platform for research and development in AI. We leverage Red Hat Enterprise Linux as our primary operating system due to its stability and security features. Network configuration is handled centrally, ensuring consistent performance.

Hardware Specifications

The cluster consists of three primary node types: Master Nodes, Compute Nodes, and Storage Nodes. Each node type is configured with specific hardware to optimize its role within the cluster.

Node Type CPU Memory Storage Network Interface
Master Nodes (2) 2 x Intel Xeon Gold 6338 256 GB DDR4 ECC 2 x 1 TB NVMe SSD (RAID 1) 100 Gbps Ethernet
Compute Nodes (10) 2 x AMD EPYC 7763 512 GB DDR4 ECC 4 x 4 TB NVMe SSD (RAID 0) 200 Gbps InfiniBand
Storage Nodes (3) 2 x Intel Xeon Silver 4310 128 GB DDR4 ECC 16 x 16 TB SAS HDD (RAID 6) 100 Gbps Ethernet

Software Stack

The software stack is designed to provide a robust and flexible environment for AI development. We utilize a combination of open-source tools and proprietary software. Containerization with Docker and Kubernetes is central to our deployment strategy.

Component Version Purpose
Operating System Red Hat Enterprise Linux 8.6 Server Base
Kubernetes v1.24.3 Container Orchestration
Docker 20.10.12 Containerization
NVIDIA CUDA Toolkit 11.7 GPU Programming
TensorFlow 2.9.1 Machine Learning Framework
PyTorch 1.12.1 Deep Learning Framework
JupyterHub 3.0.0 Interactive Computing Environment

Network Topology

The network is a critical component of the cluster, providing high-bandwidth, low-latency communication between nodes. The network is segmented into three subnets: one for the Master Nodes, one for the Compute Nodes, and one for the Storage Nodes. Firewall configuration is managed centrally to ensure security.

Subnet IP Range Nodes
Master 192.168.1.0/24 Master Node 1, Master Node 2
Compute 192.168.2.0/24 Compute Node 1 - 10
Storage 192.168.3.0/24 Storage Node 1 - 3

Security Considerations

Security is paramount. We employ multiple layers of security, including:

  • Firewall rules to restrict network access.
  • Regular security audits and vulnerability scans.
  • Strong authentication and authorization mechanisms.
  • Data encryption at rest and in transit. Data backup procedures are also in place.
  • Intrusion detection systems monitor for malicious activity.

Monitoring and Alerting

The cluster is continuously monitored using Prometheus and Grafana. Alerts are configured to notify administrators of any issues, such as high CPU usage, memory exhaustion, or disk failures. Log analysis is done using the ELK stack. We also use Nagios for basic server monitoring.

Future Enhancements

Planned upgrades include:

  • Adding more GPU-accelerated Compute Nodes.
  • Implementing a more advanced storage solution with NVMe-oF.
  • Integrating with a cloud-based object storage service. Scalability testing will be performed following any hardware changes.
  • Exploring the use of serverless computing for certain AI workloads.

Cluster maintenance is scheduled monthly to ensure the ongoing stability and performance of the system. Please refer to the troubleshooting guide for common issues.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️