Server rental store

AI in Ewell

AI in Ewell: Server Configuration Documentation

Welcome to the documentation for the "AI in Ewell" server cluster. This document details the hardware and software configuration for this system, intended for new administrators and those seeking to understand the infrastructure supporting our artificial intelligence initiatives. This cluster is dedicated to running large language models and machine learning workloads, and is critical to the ongoing research conducted in Ewell.

Overview

The “AI in Ewell” server cluster comprises a distributed network of high-performance servers designed for parallel processing. The primary goal of this configuration is to provide the computational power needed for training and deploying complex AI models. The system is housed in a dedicated, climate-controlled server room within the Ewell facility. Redundancy and scalability are key design principles, ensuring high availability and the ability to adapt to future demands. Regular System Backups are performed.

Hardware Configuration

The cluster consists of four primary node types: Master Nodes, Worker Nodes, Storage Nodes, and Network Nodes. Each node type is specifically configured to optimize its respective function. Below are detailed specifications for each.

Master Nodes

Master nodes are responsible for job scheduling, resource management, and overall cluster coordination. They are equipped with powerful processors and ample RAM to handle the overhead of these tasks.

Component Specification
CPU 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM 256 GB DDR4 ECC Registered 3200MHz
Storage (OS) 1 TB NVMe SSD
Network Interface 2 x 100GbE Network Adapters
Power Supply 2 x 1600W Redundant Power Supplies

Worker Nodes

Worker nodes perform the actual computational work of training and running AI models. They are equipped with high-end GPUs and a large amount of RAM.

Component Specification
CPU 2 x AMD EPYC 7763 (64 cores/128 threads per CPU)
RAM 512 GB DDR4 ECC Registered 3200MHz
GPU 8 x NVIDIA A100 (80GB HBM2e)
Storage (Local) 2 x 4 TB NVMe SSD (RAID 0)
Network Interface 2 x 100GbE Network Adapters

Storage Nodes

Storage nodes provide the persistent storage for datasets, models, and other important data. They utilize a distributed file system for high availability and scalability. See Storage Architecture for details.

Component Specification
CPU 2 x Intel Xeon Silver 4310 (12 cores/24 threads per CPU)
RAM 128 GB DDR4 ECC Registered 3200MHz
Storage 64 x 16 TB SAS HDD (RAID 6) – Total 1 PB usable storage
Network Interface 2 x 40GbE Network Adapters

Software Configuration

The cluster runs a customized Linux distribution based on Ubuntu Server 22.04. The following software components are essential to the operation of the AI in Ewell cluster.

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️