AI in Chatham
AI in Chatham: Server Configuration Documentation
Welcome to the documentation for the "AI in Chatham" server cluster! This article details the hardware and software configuration of our dedicated AI processing environment. This guide is aimed at new system administrators and developers who will be interacting with this infrastructure. Please read carefully.
Overview
The "AI in Chatham" project utilizes a dedicated cluster of servers located in the Chatham data center to facilitate machine learning model training, inference, and data processing. This cluster is designed for scalability and high performance, with a focus on GPU acceleration. The environment supports multiple frameworks, including TensorFlow, PyTorch, and scikit-learn. Access to the cluster is managed through a centralized authentication system and job scheduling is handled by Slurm. This document details the core components and configurations.
Hardware Specifications
The cluster comprises several node types, each optimized for specific tasks. Below are detailed specifications for each node type.
Node Type | CPU | Memory (RAM) | Storage | GPU |
---|---|---|---|---|
Master Node | 2 x Intel Xeon Gold 6248R (24 cores/CPU) | 256 GB DDR4 ECC | 1 x 1 TB NVMe SSD (OS) + 8 x 16 TB SAS HDD (Data) | None |
Compute Node (GPU-Heavy) | 2 x Intel Xeon Gold 6338 (32 cores/CPU) | 512 GB DDR4 ECC | 1 x 1 TB NVMe SSD (OS) + 2 x 8 TB SAS HDD (Data) | 4 x NVIDIA A100 (80GB) |
Compute Node (Memory-Heavy) | 2 x Intel Xeon Gold 6338 (32 cores/CPU) | 1 TB DDR4 ECC | 1 x 1 TB NVMe SSD (OS) + 2 x 8 TB SAS HDD (Data) | 2 x NVIDIA A100 (40GB) |
Storage Node | 2 x Intel Xeon Silver 4310 (12 cores/CPU) | 128 GB DDR4 ECC | 16 x 16 TB SAS HDD (RAID6) | None |
The network infrastructure utilizes a 100 Gbps InfiniBand interconnect for high-speed communication between nodes. Detailed network diagrams can be found on the network documentation page.
Software Configuration
The operating system installed on all nodes is Ubuntu Server 22.04 LTS. The cluster utilizes a centralized software deployment system based on Ansible for consistent configuration management. All nodes are monitored using Prometheus and Grafana.
Core Software Components
Component | Version | Purpose |
---|---|---|
Operating System | Ubuntu Server 22.04 LTS | Base operating system for all nodes. |
CUDA Toolkit | 12.2 | NVIDIA's parallel computing platform and API. |
cuDNN | 8.9.2 | NVIDIA's Deep Neural Network library. |
NCCL | 2.14.3 | NVIDIA Collective Communications Library for multi-GPU communication. |
Python | 3.10 | Primary programming language for AI/ML tasks. |
Slurm Workload Manager | 23.11.0 | Job scheduling and resource management. |
Networking
Each node has a static IP address assigned within the 10.0.0.0/16 network. The Master Node acts as the primary gateway and DNS server. The firewall configuration is managed centrally and allows only necessary ports for cluster operation. Access to the outside world is limited and requires explicit approval.
Storage
A shared filesystem is provided using Lustre mounted on all compute and master nodes. This allows for easy data sharing and collaboration. The Storage Nodes provide the backend storage for the Lustre filesystem. Data backup is performed nightly to a separate offsite location, as detailed in the backup policy.
Access and Usage
Access to the "AI in Chatham" cluster is granted through a request process documented on the access request page. Once access is granted, users can submit jobs using the `sbatch` command via Slurm. Detailed instructions on using Slurm can be found on the Slurm documentation page.
It is imperative that all users adhere to the resource usage policy to ensure fair access for all.
Security Considerations
The cluster is protected by a multi-layered security approach, including firewalls, intrusion detection systems, and regular security audits. All user accounts are required to use multi-factor authentication. Please report any security vulnerabilities immediately to the security team.
Future Development
Planned upgrades include expanding the GPU capacity with the latest generation of NVIDIA GPUs and implementing a more robust monitoring system. We are also exploring the integration of Kubernetes for containerized workloads.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️