AI in Leatherhead

From Server rental store
Revision as of 06:39, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

AI in Leatherhead: Server Configuration Documentation

Welcome to the documentation for the “AI in Leatherhead” server cluster. This document details the hardware and software configuration of our dedicated AI processing environment, intended for users new to the system. This guide will cover the core components, network setup, and software stack. Please read carefully before attempting any modifications or deployments. This server cluster supports a variety of machine learning tasks, including Natural Language Processing, Computer Vision, and Predictive Analytics.

Overview

The “AI in Leatherhead” cluster is a high-performance computing (HPC) environment designed to accelerate AI and machine learning workflows. It comprises multiple interconnected servers, a dedicated network, and a shared storage system. The primary goal of this setup is to provide a scalable and reliable platform for researchers and developers. The system leverages GPU acceleration for computationally intensive tasks.

Hardware Configuration

The cluster consists of three primary server types: Master Nodes, Compute Nodes, and Storage Nodes. Details of each are provided below.

Master Nodes

The Master Nodes manage the cluster, schedule jobs, and monitor resource utilization. We currently have two Master Nodes for redundancy.

Specification Value
CPU Dual Intel Xeon Gold 6338
RAM 256 GB DDR4 ECC Registered
Storage (OS) 1 TB NVMe SSD
Network Interface Dual 100 Gbps InfiniBand

Compute Nodes

The Compute Nodes perform the actual AI/ML computations. We have eight Compute Nodes currently deployed.

Specification Value
CPU Dual AMD EPYC 7763
RAM 512 GB DDR4 ECC Registered
GPU 4x NVIDIA A100 (80GB)
Storage (Local) 2 TB NVMe SSD (for temporary data)
Network Interface Dual 200 Gbps InfiniBand

Storage Nodes

The Storage Nodes provide a shared file system accessible to all nodes in the cluster. This is critical for data-intensive AI workloads.

Specification Value
CPU Intel Xeon Silver 4310
RAM 128 GB DDR4 ECC Registered
Storage (Raw) 2 PB NVMe-oF Array (Redundant)
Network Interface Dual 100 Gbps Ethernet

Network Configuration

The cluster utilizes a dedicated network to minimize latency and maximize bandwidth. The network is segmented into three main parts: Management, InfiniBand, and Ethernet.

  • Management Network: Used for SSH access, monitoring, and general administration. Utilizes a dedicated VLAN.
  • InfiniBand Network: Used for inter-node communication during job execution. Provides high-bandwidth, low-latency connectivity between Compute Nodes and Master Nodes. We use RDMA over InfiniBand.
  • Ethernet Network: Used for access to external networks and storage. Connects Storage Nodes to the Compute and Master Nodes. Uses NFS for file sharing.

The network is managed by a dedicated network administrator who ensures optimal performance and security. Firewall rules are in place to protect the cluster from unauthorized access.

Software Stack

The “AI in Leatherhead” cluster runs a standard Linux distribution with several key software packages.

  • Operating System: Ubuntu Server 22.04 LTS
  • Resource Manager: Slurm Workload Manager is used for job scheduling and resource allocation.
  • Containerization: Docker and Kubernetes are used for containerized deployments of AI/ML applications.
  • Machine Learning Frameworks: TensorFlow, PyTorch, and scikit-learn are pre-installed and configured.
  • Programming Languages: Python 3.9 is the primary programming language. R is also available.
  • Data Storage: Lustre file system is mounted on all nodes providing a shared high performance storage solution.
  • Monitoring: Prometheus and Grafana provide real-time monitoring of cluster health and performance.

Accessing the Cluster

Access to the cluster is granted through SSH. Users must have a valid account and adhere to the acceptable use policy. Job submissions are handled via the `sbatch` command provided by Slurm. Detailed instructions on using Slurm can be found on the Slurm documentation page.

Future Enhancements

We are planning to upgrade the cluster with newer GPUs and expand the storage capacity. We also intend to integrate a model registry for managing and versioning machine learning models. Further improvements to data pipelines are also planned.

Special:Search can be used to find specific information within this documentation.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️