Server rental store

AI in Coventry

# AI in Coventry: Server Configuration

This article details the server configuration supporting Artificial Intelligence (AI) initiatives within the Coventry infrastructure. It’s aimed at newcomers to the server administration side of our wiki and provides a detailed overview of the hardware and software components involved. Understanding this setup is crucial for anyone contributing to or maintaining these systems.

Overview

The "AI in Coventry" project utilizes a distributed server cluster to handle the computationally intensive tasks associated with machine learning, deep learning, and natural language processing. The system is designed for scalability and redundancy, leveraging a combination of high-performance computing (HPC) hardware and specialized software frameworks. The primary goal is to provide a robust and efficient platform for research and development in AI-related fields. This includes supporting tasks like model training, inference, and data processing. We utilize a hybrid cloud approach, with core processing occurring on-premise for data security and latency reasons.

Hardware Infrastructure

The core of the AI infrastructure consists of four primary server nodes, each dedicated to a specific role. These are supplemented by network infrastructure and storage solutions.

Server Node Specifications

The following table outlines the specifications of each server node:

Node Name CPU RAM GPU Storage
Node-Alpha 2 x Intel Xeon Gold 6338 256 GB DDR4 ECC 4 x NVIDIA A100 (80GB) 4 x 4TB NVMe SSD (RAID 0)
Node-Beta 2 x Intel Xeon Gold 6338 256 GB DDR4 ECC 4 x NVIDIA A100 (80GB) 4 x 4TB NVMe SSD (RAID 0)
Node-Gamma 2 x AMD EPYC 7763 512 GB DDR4 ECC 8 x NVIDIA H100 (80GB) 8 x 8TB NVMe SSD (RAID 0)
Node-Delta 2 x AMD EPYC 7763 512 GB DDR4 ECC 8 x NVIDIA H100 (80GB) 8 x 8TB NVMe SSD (RAID 0)

Each node runs a customized version of Ubuntu Server 22.04. Detailed information about Ubuntu Server can be found on its official website.

Network Infrastructure

The servers are interconnected using a 100Gbps InfiniBand network, providing low-latency, high-bandwidth communication crucial for distributed training. A separate 10Gbps Ethernet network is used for management and external access. See Networking Protocols for more information.

Storage Solution

A dedicated Network File System (NFS) server, running on a separate high-capacity storage array, provides shared storage for datasets and model checkpoints. The storage array uses a RAID 6 configuration for data redundancy. The NFS server is managed via NFS Configuration.

Software Stack

The software stack is built around a core set of open-source tools and frameworks.

Core Software Components

Software Version Purpose
Python 3.9 Primary programming language
TensorFlow 2.12 Deep learning framework
PyTorch 2.0 Deep learning framework
CUDA Toolkit 12.2 NVIDIA GPU programming toolkit
Docker 24.0 Containerization platform, see Docker Basics
Kubernetes 1.27 Container orchestration, see Kubernetes Introduction
MLflow 2.6 Machine Learning Lifecycle Management

Containerization and Orchestration

All AI workloads are deployed within Docker containers and orchestrated using Kubernetes. This ensures portability, reproducibility, and efficient resource utilization. Each node has a dedicated Kubernetes worker node.

Data Management

Datasets are stored on the centralized NFS server and accessed via shared mount points. Data versioning and tracking are managed using MLflow. The data pipeline utilizes Apache Kafka for real-time data ingestion.

Security Considerations

Security is paramount. Access to the server cluster is restricted to authorized personnel only, utilizing SSH key authentication and multi-factor authentication. Firewall rules are configured to limit network access to essential services. Regular security audits and vulnerability scans are performed. We adhere to the principles outlined in our Security Policy.

Monitoring and Logging

The system is continuously monitored using Prometheus and Grafana. Logs are collected and analyzed using the ELK stack (Elasticsearch, Logstash, Kibana) to identify and troubleshoot issues. Alerting is configured to notify administrators of critical events. See System Monitoring for details.

Future Enhancements

Planned future enhancements include:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️