AI in Margate
AI in Margate: Server Configuration Documentation
Welcome to the documentation for the "AI in Margate" server configuration. This article details the hardware and software setup for our artificial intelligence research and development environment located in Margate. This guide is intended for newcomers to the wiki and those responsible for maintaining the system. Understanding these configurations is crucial for troubleshooting, upgrades, and ensuring optimal performance. Please read this document carefully before making any changes to the server environment.
Overview
The “AI in Margate” project requires significant computational resources. The server infrastructure is designed for high-throughput processing of large datasets, model training, and real-time inference. The system is built around a cluster of dedicated servers, interconnected via a high-speed network. This document will cover the key components, including hardware specifications, software stack, and network configuration. We will also cover basic system administration procedures. Refer to the System Administration Guide for more in-depth information on general server maintenance procedures. Understanding Network Topology is also vital.
Hardware Specifications
The core of our AI infrastructure consists of three primary server types: Compute Nodes, Storage Nodes, and a Management Node. Each node type has specific hardware requirements to optimise its function.
Compute Nodes (AI-CN01, AI-CN02, AI-CN03, AI-CN04): These nodes are responsible for the heavy lifting of model training and inference.
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 512 GB DDR4 ECC Registered @ 3200MHz |
GPU | 4 x NVIDIA A100 80GB PCIe 4.0 |
Storage (Local) | 2 x 1.92TB NVMe PCIe 4.0 SSD (RAID 0) |
Network Interface | 2 x 100GbE Mellanox ConnectX-6 |
Power Supply | 2 x 2000W Redundant Power Supplies |
Storage Nodes (AI-SN01, AI-SN02): These nodes provide the persistent storage for datasets and model checkpoints.
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Silver 4310 (12 cores/24 threads per CPU) |
RAM | 256 GB DDR4 ECC Registered @ 3200MHz |
Storage (RAID) | 16 x 16TB SAS 7.2K RPM HDD (RAID 6 - providing approximately 192TB usable storage) |
Network Interface | 2 x 40GbE Mellanox ConnectX-5 |
Power Supply | 2 x 1600W Redundant Power Supplies |
Management Node (AI-MN01): This node handles system monitoring, user authentication, and job scheduling.
Component | Specification |
---|---|
CPU | 2 x Intel Xeon E-2336 (8 cores/16 threads per CPU) |
RAM | 64 GB DDR4 ECC Registered @ 3200MHz |
Storage | 2 x 480GB SATA SSD (RAID 1) |
Network Interface | 2 x 1GbE Intel Ethernet |
Power Supply | 1 x 850W Power Supply |
Refer to the Hardware Inventory for a complete list of all serial numbers and asset tags.
Software Stack
The operating system across all nodes is Ubuntu Server 22.04 LTS. The software stack is carefully selected to support our AI workflows. See the Software List for licensing details.
- Operating System: Ubuntu Server 22.04 LTS
- Containerization: Docker and Kubernetes are used for application deployment and management. See the Docker Configuration Guide for details.
- Programming Languages: Python 3.9 is the primary language, with support for R and Julia.
- AI Frameworks: TensorFlow, PyTorch, and scikit-learn are the core AI frameworks.
- Data Storage: Ceph is used for distributed storage across the Storage Nodes, providing scalability and redundancy. The Ceph Cluster Configuration is critical to understand.
- Job Scheduling: Slurm Workload Manager is used to manage and schedule jobs across the Compute Nodes. See Slurm Usage Guide for more information.
- Monitoring: Prometheus and Grafana are used for system monitoring and visualization. Refer to the Monitoring Dashboard Guide.
Network Configuration
The servers are connected via a dedicated 100GbE network. The network is segmented into three subnets:
- Compute Network: 192.168.1.0/24 (for communication between Compute Nodes)
- Storage Network: 192.168.2.0/24 (for communication between Storage Nodes and Compute Nodes)
- Management Network: 192.168.3.0/24 (for management access and monitoring)
A dedicated firewall protects the network from external threats. The Firewall Ruleset details the current configuration. Understanding VLAN Configuration is vital for network troubleshooting. DNS is handled internally via a Bind9 server.
System Administration
- User Accounts: Access to the servers is controlled via SSH, with user accounts managed through LDAP. See the User Account Management page.
- Backup Strategy: Regular backups of critical data are performed using BorgBackup. The Backup and Recovery Procedures document details the schedule and process.
- Security Updates: Security updates are applied automatically using unattended-upgrades.
- Log Management: Logs are centralised using the ELK stack (Elasticsearch, Logstash, Kibana). See the Log Analysis Guide.
Future Considerations
We are planning to upgrade the GPU infrastructure in the Compute Nodes to NVIDIA H100 GPUs in Q1 2024. This will significantly increase our computational capacity. Further details can be found on the Future Infrastructure Roadmap.
Main Page Server Room Access Security Protocols Troubleshooting Guide Contact Information
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️