AI in Dover

From Server rental store
Jump to navigation Jump to search

AI in Dover: Server Configuration Documentation

Welcome to the documentation for the "AI in Dover" server configuration. This article provides a detailed overview of the hardware, software, and networking components that comprise this system, intended for newcomers to our MediaWiki site and those involved in system administration. This server cluster is dedicated to running advanced Artificial Intelligence workloads, specifically focusing on natural language processing and machine learning tasks. Understanding the configuration is crucial for effective maintenance, troubleshooting, and future expansion.

Overview

The "AI in Dover" project utilizes a distributed server cluster located in our Dover data center. The primary goal is to provide a robust and scalable platform for AI research and development. The cluster consists of several interconnected servers, each with specialized hardware for accelerated computing. The system is designed for high throughput and low latency, essential for demanding AI applications. This documentation details the specifications of the core components. See also Dover Data Center Standards for general facility information.

Hardware Specifications

The cluster is built around three primary node types: Master Nodes, Worker Nodes, and Storage Nodes. Each node type is configured with specific hardware to optimize its role.

Master Nodes

Master Nodes are responsible for job scheduling, resource management, and overall cluster coordination. Two Master Nodes are deployed for redundancy.

Specification Value
CPU Dual Intel Xeon Gold 6338
RAM 512 GB DDR4 ECC Registered
Storage (OS) 1 TB NVMe SSD
Network Interface Dual 100 GbE
Power Supply Redundant 1600W Platinum

These nodes run the cluster management software – currently Kubernetes – and provide a central point of control. Refer to Kubernetes Documentation for more details.

Worker Nodes

Worker Nodes perform the actual AI computations. These nodes are equipped with high-performance GPUs. We currently have 16 Worker Nodes.

Specification Value
CPU Dual AMD EPYC 7763
RAM 256 GB DDR4 ECC Registered
GPU 4 x NVIDIA A100 80GB
Storage (Local) 2 TB NVMe SSD (for temporary data)
Network Interface Dual 100 GbE
Power Supply Redundant 2000W Titanium

The GPUs provide the necessary processing power for training and inference tasks. See GPU Driver Installation Guide for driver details. These nodes are configured with Docker for containerization of AI workloads.

Storage Nodes

Storage Nodes provide persistent storage for datasets and model checkpoints. We currently use 4 Storage Nodes.

Specification Value
CPU Intel Xeon Silver 4310
RAM 128 GB DDR4 ECC Registered
Storage 16 x 18 TB SAS HDDs (RAID 6)
Network Interface Dual 25 GbE
Power Supply Redundant 1200W Gold

The storage is accessed via a Network File System (NFS) shared file system. Consult the NFS Configuration Guide for more information. Data backup procedures are detailed in Data Backup Policy.

Software Stack

The "AI in Dover" cluster utilizes a comprehensive software stack to facilitate AI development and deployment.

Networking Configuration

The cluster utilizes a dedicated VLAN for internal communication. All nodes are connected to a high-speed network switch.

  • Network Topology: Flat network with a single subnet.
  • IP Address Range: 192.168.10.0/24
  • DNS: Internal DNS server managed by the Network Team.
  • Firewall: iptables is used to secure the cluster. Refer to Firewall Rules Documentation for details.

Security Considerations

Security is paramount for the "AI in Dover" cluster. We adhere to the Data Security Policy.

  • Access Control: Access to the cluster is restricted to authorized personnel via SSH key authentication.
  • Data Encryption: Data is encrypted at rest and in transit.
  • Regular Security Audits: The cluster undergoes regular security audits conducted by the Security Team.

Future Expansion

Plans are underway to expand the cluster with additional Worker Nodes equipped with the latest generation of GPUs. This will significantly increase the cluster's computational capacity and enable us to tackle even more complex AI challenges. See Future Hardware Roadmap for planned upgrades.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️