AI in Gillingham
AI in Gillingham: Server Configuration Documentation
Welcome to the documentation for the "AI in Gillingham" server deployment. This article aims to provide a comprehensive overview of the hardware and software configuration for this dedicated AI processing cluster. This guide is intended for new administrators and engineers onboarding to the project. Please read this document carefully before making any changes to the system. Refer to our System Administration Guidelines for general operational procedures.
Overview
The "AI in Gillingham" project utilizes a distributed server cluster to perform computationally intensive machine learning tasks, specifically focused on image recognition and natural language processing. The cluster is located within the Gillingham data center and is designed for scalability and high availability. It’s important to consult the Data Center Access Procedures before any physical access is required. The entire system is monitored via Nagios Monitoring System. Our Disaster Recovery Plan details procedures in case of failure.
Hardware Configuration
The cluster consists of a master node and four worker nodes. All nodes are interconnected via a 100Gbps InfiniBand network. Detailed specifications for each node type are provided below.
Master Node
The master node manages task scheduling, data distribution, and overall cluster health. It does *not* participate in the actual AI processing.
Specification | Value |
---|---|
CPU | Dual Intel Xeon Gold 6338 |
RAM | 256 GB DDR4 ECC Registered |
Storage (OS) | 1 TB NVMe SSD |
Storage (Metadata) | 8 TB SAS HDD in RAID 1 |
Network Interface | Dual 100Gbps InfiniBand, Dual 10Gbps Ethernet |
Power Supply | Redundant 1600W Platinum |
Worker Nodes
The worker nodes perform the actual AI processing. They are equipped with powerful GPUs to accelerate computations.
Specification | Value |
---|---|
CPU | Dual Intel Xeon Silver 4310 |
RAM | 128 GB DDR4 ECC Registered |
Storage (OS) | 512 GB NVMe SSD |
GPU | 4x NVIDIA A100 80GB |
Network Interface | Dual 100Gbps InfiniBand |
Power Supply | Redundant 1600W Platinum |
Network Infrastructure
The network is a critical component of the cluster. All nodes reside on a dedicated VLAN.
Component | Specification |
---|---|
Switch | Mellanox Spectrum-2 |
Network Topology | Fat Tree |
VLAN | 192.168.10.0/16 |
Subnet Mask | 255.255.255.0 |
Gateway | 192.168.10.1 |
Software Configuration
The cluster runs Ubuntu Server 22.04 LTS with a customized kernel for optimized GPU performance. The primary AI framework used is TensorFlow 2.12. Detailed instructions for installing and configuring TensorFlow can be found in the TensorFlow Installation Guide. We leverage Kubernetes for container orchestration and Docker for containerization.
Operating System
- Operating System: Ubuntu Server 22.04 LTS
- Kernel Version: 5.15.0-76-generic
- Filesystem: ext4
AI Framework
- Framework: TensorFlow 2.12
- CUDA Version: 11.8
- cuDNN Version: 8.6.0
Cluster Management
- Cluster Manager: Kubernetes 1.27
- Container Runtime: Docker 20.10
- Monitoring: Prometheus and Grafana (see Monitoring Dashboard Access)
Security Considerations
Security is paramount. All nodes are behind a firewall and access is restricted to authorized personnel only. Regular security audits are conducted. Refer to the Security Policy Document for detailed information. All data is encrypted at rest and in transit. We follow the principles defined in the Data Security Best Practices.
Troubleshooting
Common issues and their resolutions are documented in the Troubleshooting Guide. If you encounter a problem not covered in the guide, please submit a ticket to the Help Desk. Remember to check the System Logs for error messages.
Main Page Server Documentation Index Networking Configuration Storage Configuration Security Protocols Software Updates Backup Procedures Data Center Policies Kubernetes Administration Docker Best Practices TensorFlow Optimization GPU Driver Installation Performance Tuning User Account Management Incident Response Plan Nagios Alerting System Logs Help Desk Monitoring Dashboard Access Data Security Best Practices System Administration Guidelines Disaster Recovery Plan TensorFlow Installation Guide Security Policy Document Data Center Access Procedures Troubleshooting Guide
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️