AI in Sevenoaks
AI in Sevenoaks: Server Configuration
This document details the server configuration for the "AI in Sevenoaks" project, a dedicated cluster for Artificial Intelligence and Machine Learning workloads hosted within the Sevenoaks data center. This article is intended for newcomers to the system and aims to provide a comprehensive overview of the hardware and software components. Please refer to Sevenoaks Data Center Overview for general data center information.
Overview
The "AI in Sevenoaks" cluster is designed for high-performance computing, specifically tailored to handle the demands of training and deploying large AI models. The system prioritizes GPU acceleration, high-speed networking, and substantial storage capacity. It's critical to understand the Network Topology before attempting any modifications to the system. This cluster is distinct from the General Purpose Compute Cluster.
Hardware Components
The cluster consists of several key hardware components, detailed below. All hardware is under warranty until 2025, as documented in Hardware Warranty Information.
Compute Nodes
The primary compute nodes are the workhorses of the system.
Specification | Value |
---|---|
Manufacturer | Supermicro |
Model | SYS-220M-360 |
CPU | 2 x Intel Xeon Gold 6338 |
CPU Cores per Node | 32 |
RAM | 256 GB DDR4 ECC REG |
GPU | 4 x NVIDIA A100 80GB |
Storage (Local) | 1 TB NVMe SSD (OS & Temp) |
Network Interface | 2 x 200Gbps InfiniBand |
These nodes are interconnected using a non-blocking InfiniBand network, vital for distributed training. See InfiniBand Configuration for details. Regular hardware health checks are performed as per Server Maintenance Schedule.
Storage Node
A dedicated storage node provides centralized storage for datasets and model checkpoints.
Specification | Value |
---|---|
Manufacturer | Dell |
Model | PowerEdge R750xa |
CPU | 2 x Intel Xeon Platinum 8380 |
CPU Cores | 40 |
RAM | 512 GB DDR4 ECC REG |
Storage (Total) | 1 PB NVMe SSD (RAID 10) |
Filesystem | Lustre |
Network Interface | 4 x 100Gbps Ethernet |
The Lustre filesystem provides high throughput and scalability, crucial for handling large datasets. Refer to Lustre Filesystem Documentation for more information. The storage node is backed up nightly, as described in Backup and Recovery Procedures.
Network Infrastructure
The network is a critical component of the cluster.
Component | Specification |
---|---|
Interconnect | Mellanox Infiniband HDR |
Switches | 2 x Mellanox Spectrum-2 |
Switch Capacity | 800 Gbps |
Ethernet Network | 100 Gbps |
Firewall | Fortinet FortiGate 600F |
The Infiniband network is isolated from the external network for security reasons. See Firewall Configuration for details on network access control.
Software Stack
The software stack is built upon a Linux foundation, optimized for AI/ML workloads.
- Operating System: Ubuntu 22.04 LTS, managed via Configuration Management System.
- Containerization: Docker and Kubernetes are used for application deployment and orchestration. Refer to Kubernetes Deployment Guide.
- Machine Learning Frameworks: TensorFlow, PyTorch, and JAX are pre-installed and optimized for the NVIDIA A100 GPUs. See ML Framework Versions for specific versions.
- Job Scheduler: Slurm is used for managing and scheduling jobs across the cluster. Refer to Slurm Job Submission for instructions.
- Monitoring: Prometheus and Grafana are used for system monitoring and alerting. Detailed dashboards can be found at Monitoring Dashboards.
- Version Control: All code and configuration files are managed using Git and hosted on Internal Git Repository.
Security Considerations
Security is paramount. All access to the cluster is controlled through Access Control Procedures. Regular security audits are performed, and findings are documented in Security Audit Reports. Please familiarize yourself with the Data Security Policy.
Future Expansion
Plans are underway to expand the cluster with additional compute nodes and storage capacity in Q4 2024. See Future Expansion Plans for the latest updates.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️