AI in Redhill
AI in Redhill: Server Configuration Documentation
Welcome to the documentation for the "AI in Redhill" server cluster. This document provides a comprehensive overview of the hardware and software configuration supporting this important system. This guide is geared towards new system administrators and developers working with the Redhill infrastructure. Please familiarize yourself with the MediaWiki installation guide before proceeding.
Overview
The "AI in Redhill" project requires a high-performance computing (HPC) environment to facilitate the training and deployment of advanced artificial intelligence models. The cluster consists of three primary node types: Master Nodes, Compute Nodes, and Storage Nodes. Each node type is configured for a specific purpose, optimizing performance and reliability. Understanding the interplay between these nodes is crucial for effective system management. Refer to the System Architecture Overview for a broader context.
Master Nodes
The Master Nodes are responsible for cluster management, job scheduling, and overall system health monitoring. There are two Master Nodes for redundancy.
Hardware Specifications
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 256 GB DDR4 ECC Registered |
Storage (OS) | 2 x 1 TB NVMe SSD (RAID 1) |
Network Interface | 2 x 100GbE Network Adapters |
Power Supply | 2 x 1600W Redundant Power Supplies |
Software Configuration
- Operating System: Ubuntu Server 22.04 LTS
- Cluster Management: Slurm Workload Manager version 23.08.8
- Monitoring: Prometheus with Grafana dashboards (see Monitoring Dashboard Access for details).
- Configuration Management: Ansible used for automated configuration and deployment.
- Networking: Configured with a private network using VLANs for security and performance.
- Security: Fail2Ban is enabled for intrusion prevention.
Compute Nodes
The Compute Nodes are the workhorses of the cluster, performing the intensive computations required for AI model training and inference. We currently have 8 Compute Nodes. The Compute Node Allocation Policy outlines how these resources are assigned.
Hardware Specifications
Component | Specification |
---|---|
CPU | 2 x AMD EPYC 7763 (64 cores/128 threads per CPU) |
RAM | 512 GB DDR4 ECC Registered |
GPU | 4 x NVIDIA A100 80GB GPUs |
Storage (Local) | 4 TB NVMe SSD |
Network Interface | 2 x 100GbE Network Adapters |
Power Supply | 2 x 2000W Redundant Power Supplies |
Software Configuration
- Operating System: Ubuntu Server 22.04 LTS
- CUDA Toolkit: CUDA Toolkit 12.2
- cuDNN: cuDNN 8.9.2
- Deep Learning Frameworks: TensorFlow 2.13, PyTorch 2.0, JAX 0.4.20
- Containerization: Docker and Kubernetes are used for application deployment. See the Containerization Guide for more information.
- File System: Nodes mount the shared storage via NFS.
Storage Nodes
The Storage Nodes provide centralized, high-capacity storage for datasets, model checkpoints, and other critical data. There are two Storage Nodes configured in a high-availability cluster.
Hardware Specifications
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Silver 4310 (12 cores/24 threads per CPU) |
RAM | 128 GB DDR4 ECC Registered |
Storage | 16 x 18 TB SAS HDDs (RAID 6) – Total usable capacity: ~200 TB |
Network Interface | 2 x 40GbE Network Adapters |
Power Supply | 2 x 1200W Redundant Power Supplies |
Software Configuration
- Operating System: Ubuntu Server 22.04 LTS
- File System: Ceph distributed storage system.
- NFS: NFS server providing access to compute nodes. See NFS Configuration Details.
- Backup: Regular backups are performed using Restic.
- Monitoring: Integrated with Prometheus and Grafana.
Network Topology
The cluster utilizes a leaf-spine network topology for low latency and high bandwidth. See the Network Diagram for a visual representation. All nodes are interconnected via 100GbE switches.
Security Considerations
Security is paramount. All network traffic is encrypted using TLS. Access to the cluster is controlled via SSH key authentication. Regular security audits are conducted. Refer to the Security Policy Document for detailed information.
Troubleshooting
For common issues and troubleshooting steps, please refer to the Troubleshooting Guide. If you encounter an unresolved problem, please contact the System Administration Team.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️