AI in Basingstoke
- AI in Basingstoke: Server Configuration
This article details the server configuration supporting Artificial Intelligence (AI) initiatives within the Basingstoke data centre. It's aimed at new engineers joining the team and provides a comprehensive overview of the hardware and software setup. Understanding this configuration is crucial for maintaining system stability and facilitating future expansion. This document assumes familiarity with basic Linux server administration and networking concepts.
Overview
The Basingstoke AI infrastructure is designed for high-throughput processing of large datasets, used primarily for machine learning model training and natural language processing. We utilize a cluster of high-performance servers, interconnected via a low-latency network. The core operating system is Ubuntu Server 22.04 LTS, chosen for its stability and extensive package repository. All data is stored on a Network File System (NFS) share, providing centralized access and simplifying data management. The system relies heavily on Docker for containerization and Kubernetes for orchestration.
Hardware Specifications
The primary compute nodes are based on a standardized configuration, detailed below. There are currently 24 nodes in the cluster, with plans for expansion in Q4 2024. A dedicated monitoring server collects performance metrics.
Component | Specification |
---|---|
CPU | AMD EPYC 7763 (64 Cores, 128 Threads) |
RAM | 512GB DDR4 ECC Registered (3200MHz) |
Storage (OS) | 500GB NVMe SSD |
Storage (Data) | Access via 100GbE to central NFS server |
Network Interface | Dual 100GbE Mellanox ConnectX-6 |
GPU | 4 x NVIDIA A100 (80GB) |
The NFS server itself is a separate, highly-available system.
Component | Specification |
---|---|
CPU | Dual Intel Xeon Platinum 8380 (40 Cores each) |
RAM | 1TB DDR4 ECC Registered (3200MHz) |
Storage | 2 x 4TB NVMe SSD (RAID 1 - OS) 12 x 16TB SAS HDD (RAID 6 - Data) |
Network Interface | Dual 100GbE Mellanox ConnectX-6 |
Finally, the Kubernetes master node requires specific resources:
Component | Specification |
---|---|
CPU | Intel Xeon Gold 6338 (32 Cores) |
RAM | 256GB DDR4 ECC Registered (3200MHz) |
Storage | 1TB NVMe SSD |
Network Interface | Dual 10GbE Intel X710 |
Software Stack
The software stack is built around a containerized environment. We utilize Python 3.9 as the primary programming language for AI development. Libraries such as TensorFlow, PyTorch, and scikit-learn are pre-installed in the base Docker images.
- Operating System: Ubuntu Server 22.04 LTS
- Containerization: Docker 20.10.7
- Orchestration: Kubernetes 1.23.4
- Programming Language: Python 3.9
- AI Frameworks: TensorFlow 2.8.0, PyTorch 1.11.0, scikit-learn 1.0.2
- Data Storage: NFS v4.1
- Monitoring: Prometheus & Grafana
Network Configuration
The network is segmented into three zones: Management, Compute, and Storage. The Management network is used for accessing the servers via SSH and for system administration. The Compute network is a high-bandwidth, low-latency network used for inter-node communication within the AI cluster. The Storage network connects the compute nodes to the NFS server. All networks are firewalled using iptables. Internal DNS is provided by BIND9.
Security Considerations
Security is paramount. All servers are behind a hardware firewall. Access to the servers is restricted to authorized personnel via SSH key authentication. Regular security audits are conducted. The NFS share is configured with appropriate permissions to prevent unauthorized access. We employ intrusion detection systems (IDS) to monitor for malicious activity. All data in transit is encrypted using TLS.
Future Expansion
We are planning to upgrade the GPUs to NVIDIA H100s in Q1 2025. This will significantly increase the processing power of the cluster. We are also investigating the use of RDMA over Converged Ethernet (RoCE) to further reduce network latency. Additional storage capacity will be added to the NFS server as needed. The team is also exploring the integration of automatic scaling within the Kubernetes cluster.
Server documentation Troubleshooting guide Contact support Data backup procedures Software update policy Firewall configuration NFS configuration Kubernetes best practices Python environment setup TensorFlow installation PyTorch installation scikit-learn installation Monitoring dashboard Security incident response plan
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️