AI in the Great Rift Valley

AI in the Great Rift Valley: Server Configuration

This document details the server configuration for the “AI in the Great Rift Valley” project, detailing the hardware, software, and network infrastructure supporting our artificial intelligence research and data processing initiatives. This guide is intended for new engineers and system administrators joining the project. Understanding this configuration is crucial for maintenance, troubleshooting, and future expansion.

Overview

The “AI in the Great Rift Valley” project utilizes a distributed server cluster located in a secure data center near Nairobi, Kenya. The primary goal of this infrastructure is to process and analyze large datasets collected from various sensors deployed throughout the Rift Valley, focusing on environmental monitoring, geological event prediction, and wildlife behavior analysis. The server cluster consists of compute nodes, storage nodes, and a dedicated network infrastructure to ensure high performance and data integrity. We employ a hybrid cloud approach, leveraging on-premises hardware for sensitive data and cloud resources for burst processing. This document focuses on the on-premises infrastructure.

Hardware Configuration

The core of our infrastructure comprises three primary types of servers: Compute Nodes, Storage Nodes, and a Management Node. Each server type is detailed below.

Compute Nodes

These nodes are responsible for the computationally intensive tasks of machine learning model training and inference. We currently utilize 12 compute nodes, each with the specifications outlined below.

Specification	Value
CPU	Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM	256 GB DDR4 ECC Registered RAM
GPU	4 x NVIDIA A100 80GB GPUs
Storage (Local)	1 TB NVMe SSD (for temporary data and OS)
Network Interface	Dual 100 GbE Network Interface Cards (NICs)
Power Supply	Redundant 1600W Platinum Power Supplies

Storage Nodes

Storage nodes provide persistent storage for the raw data, processed data, and machine learning models. We have 6 storage nodes, configured for high availability and data redundancy.

Specification	Value
CPU	Intel Xeon Silver 4310 (12 cores/24 threads)
RAM	64 GB DDR4 ECC Registered RAM
Storage (Total)	6 x 16 TB SAS HDDs (RAID 6 configuration) – Total usable storage: ~80TB per node
Network Interface	Dual 40 GbE Network Interface Cards (NICs)
RAID Controller	Hardware RAID Controller with Battery Backup

Management Node

The management node is responsible for cluster monitoring, job scheduling, and overall system administration. It runs a lightweight operating system and focuses on control plane functions.

Specification	Value
CPU	Intel Xeon E-2324G (8 cores/16 threads)
RAM	32 GB DDR4 ECC Registered RAM
Storage	512 GB SATA SSD
Network Interface	Dual 1 GbE Network Interface Cards (NICs)

Software Configuration

The server cluster runs a customized distribution of Ubuntu Server 22.04 LTS. Key software components include:

Operating System: Ubuntu Server 22.04 LTS
Cluster Management: Slurm Workload Manager is used for job scheduling and resource allocation. See the Slurm Configuration Guide for detailed instructions.
Containerization: Docker and Kubernetes are used for application deployment and orchestration. The Kubernetes Cluster Setup document provides detailed setup instructions.
Data Storage: Ceph is employed as the distributed file system providing scalable and resilient storage. Refer to the Ceph Administration Guide for details.
Machine Learning Frameworks: TensorFlow, PyTorch, and Scikit-learn are the primary machine learning frameworks utilized.
Monitoring: Prometheus and Grafana are used for system monitoring and visualization. See the Monitoring Dashboard Guide for access.
Networking: Calico provides network policy enforcement and network connectivity within the Kubernetes cluster.

Network Infrastructure

The server cluster is connected via a dedicated high-speed network. Key components include:

Network Topology: Spine-Leaf architecture for low latency and high bandwidth.
Switches: Arista 7050X Series switches. See the Network Diagram for a detailed overview.
Interconnect: 100 GbE and 40 GbE connections between servers and switches.
Firewall: pfSense firewall protects the cluster from external threats. Refer to the Firewall Configuration documentation.
DNS: Internal BIND9 DNS server for name resolution within the cluster.

Security Considerations

Security is paramount. The following measures are in place:

Physical Security: Data center access is restricted and monitored 24/7.
Network Security: Firewall rules are strictly enforced, and network traffic is monitored.
Data Encryption: Data at rest and in transit is encrypted using industry-standard encryption algorithms.
Access Control: Role-Based Access Control (RBAC) is implemented to limit access to sensitive data and resources. See the Access Control Policy for more information.
Regular Security Audits: Periodic security audits are conducted to identify and address vulnerabilities.

Future Expansion

We anticipate expanding the cluster in the coming months to accommodate growing data volumes and increasingly complex machine learning models. Planned upgrades include:

Adding more compute nodes with newer generation GPUs.
Increasing the storage capacity of the storage nodes.
Implementing a disaster recovery site to ensure business continuity.
Exploring the use of additional cloud resources for burst processing.

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️

AI in the Great Rift Valley

Contents