AI in Sevenoaks

From Server rental store
Jump to navigation Jump to search

AI in Sevenoaks: Server Configuration

This document details the server configuration for the "AI in Sevenoaks" project, a dedicated cluster for Artificial Intelligence and Machine Learning workloads hosted within the Sevenoaks data center. This article is intended for newcomers to the system and aims to provide a comprehensive overview of the hardware and software components. Please refer to Sevenoaks Data Center Overview for general data center information.

Overview

The "AI in Sevenoaks" cluster is designed for high-performance computing, specifically tailored to handle the demands of training and deploying large AI models. The system prioritizes GPU acceleration, high-speed networking, and substantial storage capacity. It's critical to understand the Network Topology before attempting any modifications to the system. This cluster is distinct from the General Purpose Compute Cluster.

Hardware Components

The cluster consists of several key hardware components, detailed below. All hardware is under warranty until 2025, as documented in Hardware Warranty Information.

Compute Nodes

The primary compute nodes are the workhorses of the system.

Specification Value
Manufacturer Supermicro
Model SYS-220M-360
CPU 2 x Intel Xeon Gold 6338
CPU Cores per Node 32
RAM 256 GB DDR4 ECC REG
GPU 4 x NVIDIA A100 80GB
Storage (Local) 1 TB NVMe SSD (OS & Temp)
Network Interface 2 x 200Gbps InfiniBand

These nodes are interconnected using a non-blocking InfiniBand network, vital for distributed training. See InfiniBand Configuration for details. Regular hardware health checks are performed as per Server Maintenance Schedule.

Storage Node

A dedicated storage node provides centralized storage for datasets and model checkpoints.

Specification Value
Manufacturer Dell
Model PowerEdge R750xa
CPU 2 x Intel Xeon Platinum 8380
CPU Cores 40
RAM 512 GB DDR4 ECC REG
Storage (Total) 1 PB NVMe SSD (RAID 10)
Filesystem Lustre
Network Interface 4 x 100Gbps Ethernet

The Lustre filesystem provides high throughput and scalability, crucial for handling large datasets. Refer to Lustre Filesystem Documentation for more information. The storage node is backed up nightly, as described in Backup and Recovery Procedures.

Network Infrastructure

The network is a critical component of the cluster.

Component Specification
Interconnect Mellanox Infiniband HDR
Switches 2 x Mellanox Spectrum-2
Switch Capacity 800 Gbps
Ethernet Network 100 Gbps
Firewall Fortinet FortiGate 600F

The Infiniband network is isolated from the external network for security reasons. See Firewall Configuration for details on network access control.

Software Stack

The software stack is built upon a Linux foundation, optimized for AI/ML workloads.

  • Operating System: Ubuntu 22.04 LTS, managed via Configuration Management System.
  • Containerization: Docker and Kubernetes are used for application deployment and orchestration. Refer to Kubernetes Deployment Guide.
  • Machine Learning Frameworks: TensorFlow, PyTorch, and JAX are pre-installed and optimized for the NVIDIA A100 GPUs. See ML Framework Versions for specific versions.
  • Job Scheduler: Slurm is used for managing and scheduling jobs across the cluster. Refer to Slurm Job Submission for instructions.
  • Monitoring: Prometheus and Grafana are used for system monitoring and alerting. Detailed dashboards can be found at Monitoring Dashboards.
  • Version Control: All code and configuration files are managed using Git and hosted on Internal Git Repository.

Security Considerations

Security is paramount. All access to the cluster is controlled through Access Control Procedures. Regular security audits are performed, and findings are documented in Security Audit Reports. Please familiarize yourself with the Data Security Policy.

Future Expansion

Plans are underway to expand the cluster with additional compute nodes and storage capacity in Q4 2024. See Future Expansion Plans for the latest updates.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️