AI in the Great Rift Valley
AI in the Great Rift Valley: Server Configuration
This document details the server configuration for the “AI in the Great Rift Valley” project, detailing the hardware, software, and network infrastructure supporting our artificial intelligence research and data processing initiatives. This guide is intended for new engineers and system administrators joining the project. Understanding this configuration is crucial for maintenance, troubleshooting, and future expansion.
Overview
The “AI in the Great Rift Valley” project utilizes a distributed server cluster located in a secure data center near Nairobi, Kenya. The primary goal of this infrastructure is to process and analyze large datasets collected from various sensors deployed throughout the Rift Valley, focusing on environmental monitoring, geological event prediction, and wildlife behavior analysis. The server cluster consists of compute nodes, storage nodes, and a dedicated network infrastructure to ensure high performance and data integrity. We employ a hybrid cloud approach, leveraging on-premises hardware for sensitive data and cloud resources for burst processing. This document focuses on the on-premises infrastructure.
Hardware Configuration
The core of our infrastructure comprises three primary types of servers: Compute Nodes, Storage Nodes, and a Management Node. Each server type is detailed below.
Compute Nodes
These nodes are responsible for the computationally intensive tasks of machine learning model training and inference. We currently utilize 12 compute nodes, each with the specifications outlined below.
Specification | Value |
---|---|
CPU | Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 256 GB DDR4 ECC Registered RAM |
GPU | 4 x NVIDIA A100 80GB GPUs |
Storage (Local) | 1 TB NVMe SSD (for temporary data and OS) |
Network Interface | Dual 100 GbE Network Interface Cards (NICs) |
Power Supply | Redundant 1600W Platinum Power Supplies |
Storage Nodes
Storage nodes provide persistent storage for the raw data, processed data, and machine learning models. We have 6 storage nodes, configured for high availability and data redundancy.
Specification | Value |
---|---|
CPU | Intel Xeon Silver 4310 (12 cores/24 threads) |
RAM | 64 GB DDR4 ECC Registered RAM |
Storage (Total) | 6 x 16 TB SAS HDDs (RAID 6 configuration) – Total usable storage: ~80TB per node |
Network Interface | Dual 40 GbE Network Interface Cards (NICs) |
RAID Controller | Hardware RAID Controller with Battery Backup |
Management Node
The management node is responsible for cluster monitoring, job scheduling, and overall system administration. It runs a lightweight operating system and focuses on control plane functions.
Specification | Value |
---|---|
CPU | Intel Xeon E-2324G (8 cores/16 threads) |
RAM | 32 GB DDR4 ECC Registered RAM |
Storage | 512 GB SATA SSD |
Network Interface | Dual 1 GbE Network Interface Cards (NICs) |
Software Configuration
The server cluster runs a customized distribution of Ubuntu Server 22.04 LTS. Key software components include:
- Operating System: Ubuntu Server 22.04 LTS
- Cluster Management: Slurm Workload Manager is used for job scheduling and resource allocation. See the Slurm Configuration Guide for detailed instructions.
- Containerization: Docker and Kubernetes are used for application deployment and orchestration. The Kubernetes Cluster Setup document provides detailed setup instructions.
- Data Storage: Ceph is employed as the distributed file system providing scalable and resilient storage. Refer to the Ceph Administration Guide for details.
- Machine Learning Frameworks: TensorFlow, PyTorch, and Scikit-learn are the primary machine learning frameworks utilized.
- Monitoring: Prometheus and Grafana are used for system monitoring and visualization. See the Monitoring Dashboard Guide for access.
- Networking: Calico provides network policy enforcement and network connectivity within the Kubernetes cluster.
Network Infrastructure
The server cluster is connected via a dedicated high-speed network. Key components include:
- Network Topology: Spine-Leaf architecture for low latency and high bandwidth.
- Switches: Arista 7050X Series switches. See the Network Diagram for a detailed overview.
- Interconnect: 100 GbE and 40 GbE connections between servers and switches.
- Firewall: pfSense firewall protects the cluster from external threats. Refer to the Firewall Configuration documentation.
- DNS: Internal BIND9 DNS server for name resolution within the cluster.
Security Considerations
Security is paramount. The following measures are in place:
- Physical Security: Data center access is restricted and monitored 24/7.
- Network Security: Firewall rules are strictly enforced, and network traffic is monitored.
- Data Encryption: Data at rest and in transit is encrypted using industry-standard encryption algorithms.
- Access Control: Role-Based Access Control (RBAC) is implemented to limit access to sensitive data and resources. See the Access Control Policy for more information.
- Regular Security Audits: Periodic security audits are conducted to identify and address vulnerabilities.
Future Expansion
We anticipate expanding the cluster in the coming months to accommodate growing data volumes and increasingly complex machine learning models. Planned upgrades include:
- Adding more compute nodes with newer generation GPUs.
- Increasing the storage capacity of the storage nodes.
- Implementing a disaster recovery site to ensure business continuity.
- Exploring the use of additional cloud resources for burst processing.
Related Documentation
- Server Room Access Procedures
- Data Backup and Recovery Policy
- Incident Response Plan
- Software Licensing Information
- Hardware Inventory
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️