Server rental store

AI in Sevenoaks

AI in Sevenoaks: Server Configuration

This document details the server configuration for the "AI in Sevenoaks" project, a dedicated cluster for Artificial Intelligence and Machine Learning workloads hosted within the Sevenoaks data center. This article is intended for newcomers to the system and aims to provide a comprehensive overview of the hardware and software components. Please refer to Sevenoaks Data Center Overview for general data center information.

Overview

The "AI in Sevenoaks" cluster is designed for high-performance computing, specifically tailored to handle the demands of training and deploying large AI models. The system prioritizes GPU acceleration, high-speed networking, and substantial storage capacity. It's critical to understand the Network Topology before attempting any modifications to the system. This cluster is distinct from the General Purpose Compute Cluster.

Hardware Components

The cluster consists of several key hardware components, detailed below. All hardware is under warranty until 2025, as documented in Hardware Warranty Information.

Compute Nodes

The primary compute nodes are the workhorses of the system.

Specification Value
Manufacturer Supermicro
Model SYS-220M-360
CPU 2 x Intel Xeon Gold 6338
CPU Cores per Node 32
RAM 256 GB DDR4 ECC REG
GPU 4 x NVIDIA A100 80GB
Storage (Local) 1 TB NVMe SSD (OS & Temp)
Network Interface 2 x 200Gbps InfiniBand

These nodes are interconnected using a non-blocking InfiniBand network, vital for distributed training. See InfiniBand Configuration for details. Regular hardware health checks are performed as per Server Maintenance Schedule.

Storage Node

A dedicated storage node provides centralized storage for datasets and model checkpoints.

Specification Value
Manufacturer Dell
Model PowerEdge R750xa
CPU 2 x Intel Xeon Platinum 8380
CPU Cores 40
RAM 512 GB DDR4 ECC REG
Storage (Total) 1 PB NVMe SSD (RAID 10)
Filesystem Lustre
Network Interface 4 x 100Gbps Ethernet

The Lustre filesystem provides high throughput and scalability, crucial for handling large datasets. Refer to Lustre Filesystem Documentation for more information. The storage node is backed up nightly, as described in Backup and Recovery Procedures.

Network Infrastructure

The network is a critical component of the cluster.

Component Specification
Interconnect Mellanox Infiniband HDR
Switches 2 x Mellanox Spectrum-2
Switch Capacity 800 Gbps
Ethernet Network 100 Gbps
Firewall Fortinet FortiGate 600F

The Infiniband network is isolated from the external network for security reasons. See Firewall Configuration for details on network access control.

Software Stack

The software stack is built upon a Linux foundation, optimized for AI/ML workloads.

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️