AI in Leatherhead
AI in Leatherhead: Server Configuration Documentation
Welcome to the documentation for the “AI in Leatherhead” server cluster. This document details the hardware and software configuration of our dedicated AI processing environment, intended for users new to the system. This guide will cover the core components, network setup, and software stack. Please read carefully before attempting any modifications or deployments. This server cluster supports a variety of machine learning tasks, including Natural Language Processing, Computer Vision, and Predictive Analytics.
Overview
The “AI in Leatherhead” cluster is a high-performance computing (HPC) environment designed to accelerate AI and machine learning workflows. It comprises multiple interconnected servers, a dedicated network, and a shared storage system. The primary goal of this setup is to provide a scalable and reliable platform for researchers and developers. The system leverages GPU acceleration for computationally intensive tasks.
Hardware Configuration
The cluster consists of three primary server types: Master Nodes, Compute Nodes, and Storage Nodes. Details of each are provided below.
Master Nodes
The Master Nodes manage the cluster, schedule jobs, and monitor resource utilization. We currently have two Master Nodes for redundancy.
Specification | Value |
---|---|
CPU | Dual Intel Xeon Gold 6338 |
RAM | 256 GB DDR4 ECC Registered |
Storage (OS) | 1 TB NVMe SSD |
Network Interface | Dual 100 Gbps InfiniBand |
Compute Nodes
The Compute Nodes perform the actual AI/ML computations. We have eight Compute Nodes currently deployed.
Specification | Value |
---|---|
CPU | Dual AMD EPYC 7763 |
RAM | 512 GB DDR4 ECC Registered |
GPU | 4x NVIDIA A100 (80GB) |
Storage (Local) | 2 TB NVMe SSD (for temporary data) |
Network Interface | Dual 200 Gbps InfiniBand |
Storage Nodes
The Storage Nodes provide a shared file system accessible to all nodes in the cluster. This is critical for data-intensive AI workloads.
Specification | Value |
---|---|
CPU | Intel Xeon Silver 4310 |
RAM | 128 GB DDR4 ECC Registered |
Storage (Raw) | 2 PB NVMe-oF Array (Redundant) |
Network Interface | Dual 100 Gbps Ethernet |
Network Configuration
The cluster utilizes a dedicated network to minimize latency and maximize bandwidth. The network is segmented into three main parts: Management, InfiniBand, and Ethernet.
- Management Network: Used for SSH access, monitoring, and general administration. Utilizes a dedicated VLAN.
- InfiniBand Network: Used for inter-node communication during job execution. Provides high-bandwidth, low-latency connectivity between Compute Nodes and Master Nodes. We use RDMA over InfiniBand.
- Ethernet Network: Used for access to external networks and storage. Connects Storage Nodes to the Compute and Master Nodes. Uses NFS for file sharing.
The network is managed by a dedicated network administrator who ensures optimal performance and security. Firewall rules are in place to protect the cluster from unauthorized access.
Software Stack
The “AI in Leatherhead” cluster runs a standard Linux distribution with several key software packages.
- Operating System: Ubuntu Server 22.04 LTS
- Resource Manager: Slurm Workload Manager is used for job scheduling and resource allocation.
- Containerization: Docker and Kubernetes are used for containerized deployments of AI/ML applications.
- Machine Learning Frameworks: TensorFlow, PyTorch, and scikit-learn are pre-installed and configured.
- Programming Languages: Python 3.9 is the primary programming language. R is also available.
- Data Storage: Lustre file system is mounted on all nodes providing a shared high performance storage solution.
- Monitoring: Prometheus and Grafana provide real-time monitoring of cluster health and performance.
Accessing the Cluster
Access to the cluster is granted through SSH. Users must have a valid account and adhere to the acceptable use policy. Job submissions are handled via the `sbatch` command provided by Slurm. Detailed instructions on using Slurm can be found on the Slurm documentation page.
Future Enhancements
We are planning to upgrade the cluster with newer GPUs and expand the storage capacity. We also intend to integrate a model registry for managing and versioning machine learning models. Further improvements to data pipelines are also planned.
Special:Search can be used to find specific information within this documentation.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️