AI in Chatham

From Server rental store
Jump to navigation Jump to search

AI in Chatham: Server Configuration Documentation

Welcome to the documentation for the "AI in Chatham" server cluster! This article details the hardware and software configuration of our dedicated AI processing environment. This guide is aimed at new system administrators and developers who will be interacting with this infrastructure. Please read carefully.

Overview

The "AI in Chatham" project utilizes a dedicated cluster of servers located in the Chatham data center to facilitate machine learning model training, inference, and data processing. This cluster is designed for scalability and high performance, with a focus on GPU acceleration. The environment supports multiple frameworks, including TensorFlow, PyTorch, and scikit-learn. Access to the cluster is managed through a centralized authentication system and job scheduling is handled by Slurm. This document details the core components and configurations.

Hardware Specifications

The cluster comprises several node types, each optimized for specific tasks. Below are detailed specifications for each node type.

Node Type CPU Memory (RAM) Storage GPU
Master Node 2 x Intel Xeon Gold 6248R (24 cores/CPU) 256 GB DDR4 ECC 1 x 1 TB NVMe SSD (OS) + 8 x 16 TB SAS HDD (Data) None
Compute Node (GPU-Heavy) 2 x Intel Xeon Gold 6338 (32 cores/CPU) 512 GB DDR4 ECC 1 x 1 TB NVMe SSD (OS) + 2 x 8 TB SAS HDD (Data) 4 x NVIDIA A100 (80GB)
Compute Node (Memory-Heavy) 2 x Intel Xeon Gold 6338 (32 cores/CPU) 1 TB DDR4 ECC 1 x 1 TB NVMe SSD (OS) + 2 x 8 TB SAS HDD (Data) 2 x NVIDIA A100 (40GB)
Storage Node 2 x Intel Xeon Silver 4310 (12 cores/CPU) 128 GB DDR4 ECC 16 x 16 TB SAS HDD (RAID6) None

The network infrastructure utilizes a 100 Gbps InfiniBand interconnect for high-speed communication between nodes. Detailed network diagrams can be found on the network documentation page.

Software Configuration

The operating system installed on all nodes is Ubuntu Server 22.04 LTS. The cluster utilizes a centralized software deployment system based on Ansible for consistent configuration management. All nodes are monitored using Prometheus and Grafana.

Core Software Components

Component Version Purpose
Operating System Ubuntu Server 22.04 LTS Base operating system for all nodes.
CUDA Toolkit 12.2 NVIDIA's parallel computing platform and API.
cuDNN 8.9.2 NVIDIA's Deep Neural Network library.
NCCL 2.14.3 NVIDIA Collective Communications Library for multi-GPU communication.
Python 3.10 Primary programming language for AI/ML tasks.
Slurm Workload Manager 23.11.0 Job scheduling and resource management.

Networking

Each node has a static IP address assigned within the 10.0.0.0/16 network. The Master Node acts as the primary gateway and DNS server. The firewall configuration is managed centrally and allows only necessary ports for cluster operation. Access to the outside world is limited and requires explicit approval.

Storage

A shared filesystem is provided using Lustre mounted on all compute and master nodes. This allows for easy data sharing and collaboration. The Storage Nodes provide the backend storage for the Lustre filesystem. Data backup is performed nightly to a separate offsite location, as detailed in the backup policy.

Access and Usage

Access to the "AI in Chatham" cluster is granted through a request process documented on the access request page. Once access is granted, users can submit jobs using the `sbatch` command via Slurm. Detailed instructions on using Slurm can be found on the Slurm documentation page.

It is imperative that all users adhere to the resource usage policy to ensure fair access for all.

Security Considerations

The cluster is protected by a multi-layered security approach, including firewalls, intrusion detection systems, and regular security audits. All user accounts are required to use multi-factor authentication. Please report any security vulnerabilities immediately to the security team.

Future Development

Planned upgrades include expanding the GPU capacity with the latest generation of NVIDIA GPUs and implementing a more robust monitoring system. We are also exploring the integration of Kubernetes for containerized workloads.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️