Server rental store

AI in Kent

# AI in Kent: Server Configuration

This document details the server configuration for the "AI in Kent" project, providing a technical overview for system administrators, developers, and anyone contributing to the platform. It covers hardware specifications, software stack, networking, and security considerations. This guide assumes a basic understanding of Linux server administration and MediaWiki concepts.

Overview

The "AI in Kent" project leverages a cluster of servers to provide the computational resources needed for training and deploying artificial intelligence models, specifically focusing on natural language processing and computer vision. The servers are located in a dedicated data center within Kent, and are connected via a high-speed network. This documentation outlines the key components and configurations. See Special:MyUserPage for contact information for the team maintaining this infrastructure.

Hardware Specifications

The server cluster comprises three primary types of nodes: Master Nodes, Worker Nodes, and Storage Nodes. Each node type has specific hardware requirements to optimize performance and reliability.

Node Type CPU RAM Storage Network Interface
Master Nodes (2) 2 x Intel Xeon Gold 6248R (24 cores each) 256 GB DDR4 ECC REG 2 x 1TB NVMe SSD (RAID 1) 100 Gbps Ethernet
Worker Nodes (8) 2 x AMD EPYC 7763 (64 cores each) 512 GB DDR4 ECC REG 4 x 4TB SAS HDD (RAID 10) + 1 x 1TB NVMe SSD (local scratch) 100 Gbps Ethernet
Storage Nodes (3) 2 x Intel Xeon Silver 4210 (10 cores each) 128 GB DDR4 ECC REG 16 x 16TB SAS HDD (RAID 6) 40 Gbps Ethernet

These specifications were chosen to balance cost-effectiveness with the demanding requirements of AI workloads. The use of NVMe SSDs for the Master Nodes ensures fast boot times and responsiveness, while the large RAM capacity on the Worker Nodes allows for handling large datasets. Server Room Access is restricted to authorized personnel only.

Software Stack

The software stack is built on a foundation of Ubuntu Server 22.04 LTS.

Component Version Purpose
Operating System Ubuntu Server 22.04 LTS Provides the base operating system environment.
Kubernetes v1.27 Container orchestration platform for managing and scaling AI workloads. See Kubernetes Documentation.
Docker 20.10.21 Containerization technology for packaging and deploying AI models.
NVIDIA CUDA Toolkit 11.8 Provides the tools and libraries for GPU-accelerated computing. GPU Driver Updates are critical.
TensorFlow 2.12 Machine learning framework.
PyTorch 2.0 Machine learning framework.
Ceph Pacific Distributed storage system for managing large datasets.

All software packages are managed using `apt` and regularly updated to ensure security and stability. We utilize Ansible for automated configuration management.

Networking Configuration

The server cluster is connected via a dedicated VLAN with a /24 subnet. Each node is assigned a static IP address. The network topology is a spine-leaf architecture, providing high bandwidth and low latency.

Network Segment Subnet Gateway DNS Server
Management Network 192.168.1.0/24 192.168.1.1 8.8.8.8, 8.8.4.4
Cluster Network 10.0.0.0/24 10.0.0.1 10.0.0.1
Storage Network 172.16.0.0/24 172.16.0.1 172.16.0.1

Firewall rules are configured using `ufw` to restrict access to only necessary ports. See Network Diagram for detailed visualization. The use of VPN Access is required for remote administration.

Security Considerations

Security is a paramount concern. The following measures are in place:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️