AI in Ewell
AI in Ewell: Server Configuration Documentation
Welcome to the documentation for the "AI in Ewell" server cluster. This document details the hardware and software configuration for this system, intended for new administrators and those seeking to understand the infrastructure supporting our artificial intelligence initiatives. This cluster is dedicated to running large language models and machine learning workloads, and is critical to the ongoing research conducted in Ewell.
Overview
The “AI in Ewell” server cluster comprises a distributed network of high-performance servers designed for parallel processing. The primary goal of this configuration is to provide the computational power needed for training and deploying complex AI models. The system is housed in a dedicated, climate-controlled server room within the Ewell facility. Redundancy and scalability are key design principles, ensuring high availability and the ability to adapt to future demands. Regular System Backups are performed.
Hardware Configuration
The cluster consists of four primary node types: Master Nodes, Worker Nodes, Storage Nodes, and Network Nodes. Each node type is specifically configured to optimize its respective function. Below are detailed specifications for each.
Master Nodes
Master nodes are responsible for job scheduling, resource management, and overall cluster coordination. They are equipped with powerful processors and ample RAM to handle the overhead of these tasks.
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 256 GB DDR4 ECC Registered 3200MHz |
Storage (OS) | 1 TB NVMe SSD |
Network Interface | 2 x 100GbE Network Adapters |
Power Supply | 2 x 1600W Redundant Power Supplies |
Worker Nodes
Worker nodes perform the actual computational work of training and running AI models. They are equipped with high-end GPUs and a large amount of RAM.
Component | Specification |
---|---|
CPU | 2 x AMD EPYC 7763 (64 cores/128 threads per CPU) |
RAM | 512 GB DDR4 ECC Registered 3200MHz |
GPU | 8 x NVIDIA A100 (80GB HBM2e) |
Storage (Local) | 2 x 4 TB NVMe SSD (RAID 0) |
Network Interface | 2 x 100GbE Network Adapters |
Storage Nodes
Storage nodes provide the persistent storage for datasets, models, and other important data. They utilize a distributed file system for high availability and scalability. See Storage Architecture for details.
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Silver 4310 (12 cores/24 threads per CPU) |
RAM | 128 GB DDR4 ECC Registered 3200MHz |
Storage | 64 x 16 TB SAS HDD (RAID 6) – Total 1 PB usable storage |
Network Interface | 2 x 40GbE Network Adapters |
Software Configuration
The cluster runs a customized Linux distribution based on Ubuntu Server 22.04. The following software components are essential to the operation of the AI in Ewell cluster.
- Operating System: Ubuntu Server 22.04 LTS
- Cluster Management: Slurm Workload Manager
- Containerization: Docker and Kubernetes
- Machine Learning Frameworks: TensorFlow, PyTorch, scikit-learn
- Programming Languages: Python, C++
- File System: GlusterFS
- Monitoring: Prometheus and Grafana
Networking
The network infrastructure is a critical component of the AI in Ewell cluster. A dedicated 100GbE network connects all nodes, providing high bandwidth and low latency communication. A separate 10GbE network is used for management and monitoring traffic. Detailed network diagrams are available in the Network Topology document. We utilize Virtual LANs for segmentation.
Security Considerations
Security is paramount. The cluster is protected by a firewall, intrusion detection system, and regular security audits. Access to the cluster is restricted to authorized personnel only. All data is encrypted both in transit and at rest. Refer to the Security Protocol for detailed information.
Future Expansion
Plans are underway to expand the cluster with additional worker nodes and storage capacity. This expansion will involve upgrading the network infrastructure to 200GbE and implementing a more advanced cooling system. A detailed roadmap for future expansion is documented in the Expansion Plan. The addition of NVLink interconnects is also being considered.
Troubleshooting
Common issues and their solutions are documented in the Troubleshooting Guide. Please consult this guide before contacting support.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️