Server rental store

AI in Dover

AI in Dover: Server Configuration Documentation

Welcome to the documentation for the "AI in Dover" server configuration. This article provides a detailed overview of the hardware, software, and networking components that comprise this system, intended for newcomers to our MediaWiki site and those involved in system administration. This server cluster is dedicated to running advanced Artificial Intelligence workloads, specifically focusing on natural language processing and machine learning tasks. Understanding the configuration is crucial for effective maintenance, troubleshooting, and future expansion.

Overview

The "AI in Dover" project utilizes a distributed server cluster located in our Dover data center. The primary goal is to provide a robust and scalable platform for AI research and development. The cluster consists of several interconnected servers, each with specialized hardware for accelerated computing. The system is designed for high throughput and low latency, essential for demanding AI applications. This documentation details the specifications of the core components. See also Dover Data Center Standards for general facility information.

Hardware Specifications

The cluster is built around three primary node types: Master Nodes, Worker Nodes, and Storage Nodes. Each node type is configured with specific hardware to optimize its role.

Master Nodes

Master Nodes are responsible for job scheduling, resource management, and overall cluster coordination. Two Master Nodes are deployed for redundancy.

Specification Value
CPU Dual Intel Xeon Gold 6338
RAM 512 GB DDR4 ECC Registered
Storage (OS) 1 TB NVMe SSD
Network Interface Dual 100 GbE
Power Supply Redundant 1600W Platinum

These nodes run the cluster management software – currently Kubernetes – and provide a central point of control. Refer to Kubernetes Documentation for more details.

Worker Nodes

Worker Nodes perform the actual AI computations. These nodes are equipped with high-performance GPUs. We currently have 16 Worker Nodes.

Specification Value
CPU Dual AMD EPYC 7763
RAM 256 GB DDR4 ECC Registered
GPU 4 x NVIDIA A100 80GB
Storage (Local) 2 TB NVMe SSD (for temporary data)
Network Interface Dual 100 GbE
Power Supply Redundant 2000W Titanium

The GPUs provide the necessary processing power for training and inference tasks. See GPU Driver Installation Guide for driver details. These nodes are configured with Docker for containerization of AI workloads.

Storage Nodes

Storage Nodes provide persistent storage for datasets and model checkpoints. We currently use 4 Storage Nodes.

Specification Value
CPU Intel Xeon Silver 4310
RAM 128 GB DDR4 ECC Registered
Storage 16 x 18 TB SAS HDDs (RAID 6)
Network Interface Dual 25 GbE
Power Supply Redundant 1200W Gold

The storage is accessed via a Network File System (NFS) shared file system. Consult the NFS Configuration Guide for more information. Data backup procedures are detailed in Data Backup Policy.

Software Stack

The "AI in Dover" cluster utilizes a comprehensive software stack to facilitate AI development and deployment.

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️