AI in Coventry

AI in Coventry: Server Configuration

This article details the server configuration supporting Artificial Intelligence (AI) initiatives within the Coventry infrastructure. It’s aimed at newcomers to the server administration side of our wiki and provides a detailed overview of the hardware and software components involved. Understanding this setup is crucial for anyone contributing to or maintaining these systems.

Overview

The "AI in Coventry" project utilizes a distributed server cluster to handle the computationally intensive tasks associated with machine learning, deep learning, and natural language processing. The system is designed for scalability and redundancy, leveraging a combination of high-performance computing (HPC) hardware and specialized software frameworks. The primary goal is to provide a robust and efficient platform for research and development in AI-related fields. This includes supporting tasks like model training, inference, and data processing. We utilize a hybrid cloud approach, with core processing occurring on-premise for data security and latency reasons.

Hardware Infrastructure

The core of the AI infrastructure consists of four primary server nodes, each dedicated to a specific role. These are supplemented by network infrastructure and storage solutions.

Server Node Specifications

The following table outlines the specifications of each server node:

Node Name	CPU	RAM	GPU	Storage
Node-Alpha	2 x Intel Xeon Gold 6338	256 GB DDR4 ECC	4 x NVIDIA A100 (80GB)	4 x 4TB NVMe SSD (RAID 0)
Node-Beta	2 x Intel Xeon Gold 6338	256 GB DDR4 ECC	4 x NVIDIA A100 (80GB)	4 x 4TB NVMe SSD (RAID 0)
Node-Gamma	2 x AMD EPYC 7763	512 GB DDR4 ECC	8 x NVIDIA H100 (80GB)	8 x 8TB NVMe SSD (RAID 0)
Node-Delta	2 x AMD EPYC 7763	512 GB DDR4 ECC	8 x NVIDIA H100 (80GB)	8 x 8TB NVMe SSD (RAID 0)

Each node runs a customized version of Ubuntu Server 22.04. Detailed information about Ubuntu Server can be found on its official website.

Network Infrastructure

The servers are interconnected using a 100Gbps InfiniBand network, providing low-latency, high-bandwidth communication crucial for distributed training. A separate 10Gbps Ethernet network is used for management and external access. See Networking Protocols for more information.

Storage Solution

A dedicated Network File System (NFS) server, running on a separate high-capacity storage array, provides shared storage for datasets and model checkpoints. The storage array uses a RAID 6 configuration for data redundancy. The NFS server is managed via NFS Configuration.

Software Stack

The software stack is built around a core set of open-source tools and frameworks.

Core Software Components

Software	Version	Purpose
Python	3.9	Primary programming language
TensorFlow	2.12	Deep learning framework
PyTorch	2.0	Deep learning framework
CUDA Toolkit	12.2	NVIDIA GPU programming toolkit
Docker	24.0	Containerization platform, see Docker Basics
Kubernetes	1.27	Container orchestration, see Kubernetes Introduction
MLflow	2.6	Machine Learning Lifecycle Management

Containerization and Orchestration

All AI workloads are deployed within Docker containers and orchestrated using Kubernetes. This ensures portability, reproducibility, and efficient resource utilization. Each node has a dedicated Kubernetes worker node.

Data Management

Datasets are stored on the centralized NFS server and accessed via shared mount points. Data versioning and tracking are managed using MLflow. The data pipeline utilizes Apache Kafka for real-time data ingestion.

Security Considerations

Security is paramount. Access to the server cluster is restricted to authorized personnel only, utilizing SSH key authentication and multi-factor authentication. Firewall rules are configured to limit network access to essential services. Regular security audits and vulnerability scans are performed. We adhere to the principles outlined in our Security Policy.

Monitoring and Logging

The system is continuously monitored using Prometheus and Grafana. Logs are collected and analyzed using the ELK stack (Elasticsearch, Logstash, Kibana) to identify and troubleshoot issues. Alerting is configured to notify administrators of critical events. See System Monitoring for details.

Future Enhancements

Planned future enhancements include:

Integration with a cloud-based object storage solution for long-term data archiving.
Implementation of automated scaling capabilities for Kubernetes.
Exploration of federated learning techniques to enable collaborative model training across multiple institutions.
Upgrading to the latest generation of GPUs as they become available. See GPU Technology.

Server Maintenance is crucial for continued operation.

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️