AI in Southampton
- AI in Southampton: Server Configuration Documentation
This document details the server configuration powering the "AI in Southampton" project, aimed at providing a resource for new team members and those contributing to the infrastructure. This article assumes a basic familiarity with Linux server administration and networking concepts.
Overview
The "AI in Southampton" project utilizes a cluster of servers located within the University of Southampton's High-Performance Computing (HPC) facility. These servers are dedicated to training and deploying machine learning models, with a particular focus on natural language processing and computer vision. The current configuration consists of three primary server roles: Data Storage, Processing (Training), and Inference/Serving. This documentation outlines the specifications and software stack of each.
Data Storage Servers
The data storage servers are responsible for holding the large datasets used for training and evaluation. Redundancy is a key consideration, and we utilize a RAID configuration for data protection.
Server Name | Role | Operating System | CPU | RAM | Storage | Network Interface |
---|---|---|---|---|---|---|
data-srv-01 | Primary Data Storage | Ubuntu 22.04 LTS | Intel Xeon Gold 6248R (24 cores) | 256 GB DDR4 ECC | 16 x 16TB SAS HDDs (RAID 6) - ~232 TB usable | 10 GbE |
data-srv-02 | Secondary Data Storage / Backup | Ubuntu 22.04 LTS | Intel Xeon Silver 4210 (10 cores) | 128 GB DDR4 ECC | 8 x 16TB SAS HDDs (RAID 6) - ~112 TB usable | 1 GbE |
These servers utilize the Ceph distributed file system for scalability and resilience. Access is controlled via SSH and sftp. Data backups are performed nightly to tape storage. The file system is mounted via NFS to the processing servers. We also employ rsync for incremental backups.
Processing (Training) Servers
These servers perform the computationally intensive task of training machine learning models. They are equipped with powerful GPUs and large amounts of RAM.
Server Name | Role | Operating System | CPU | RAM | GPU | Storage | Network Interface |
---|---|---|---|---|---|---|---|
train-srv-01 | Primary Training Server | Ubuntu 22.04 LTS | AMD EPYC 7763 (64 cores) | 512 GB DDR4 ECC | 4 x NVIDIA A100 (80GB) | 2 x 2TB NVMe SSD (RAID 1) | 100 GbE |
train-srv-02 | Secondary Training Server | Ubuntu 22.04 LTS | AMD EPYC 7763 (64 cores) | 512 GB DDR4 ECC | 4 x NVIDIA A100 (80GB) | 2 x 2TB NVMe SSD (RAID 1) | 100 GbE |
train-srv-03 | Distributed Training Node | Ubuntu 22.04 LTS | Intel Xeon Gold 6338 (32 cores) | 256 GB DDR4 ECC | 2 x NVIDIA RTX 3090 (24GB) | 1 x 1TB NVMe SSD | 10 GbE |
The training environment utilizes Docker containers managed by Kubernetes for reproducible builds and simplified deployment. We employ MPI for distributed training across multiple nodes. The primary deep learning framework is TensorFlow, with support for PyTorch also available. CUDA drivers are regularly updated to ensure optimal GPU performance. A job scheduler manages resource allocation.
Inference/Serving Servers
These servers are responsible for deploying trained models and serving predictions to applications. They prioritize low latency and high availability.
Server Name | Role | Operating System | CPU | RAM | GPU | Storage | Network Interface |
---|---|---|---|---|---|---|---|
infer-srv-01 | Primary Inference Server | Ubuntu 22.04 LTS | Intel Xeon Gold 6230 (20 cores) | 128 GB DDR4 ECC | 2 x NVIDIA T4 (16GB) | 1 x 1TB NVMe SSD | 10 GbE |
infer-srv-02 | Secondary Inference Server / Load Balancer | Ubuntu 22.04 LTS | Intel Xeon Gold 6230 (20 cores) | 128 GB DDR4 ECC | 2 x NVIDIA T4 (16GB) | 1 x 1TB NVMe SSD | 10 GbE |
Model deployment is handled using TensorFlow Serving. A reverse proxy (Nginx) distributes traffic across the inference servers. We utilize Prometheus for monitoring server performance and Grafana for visualization. The inference servers are secured with TLS/SSL. REST APIs are used for accessing the models. We also utilize gRPC for high-performance communication.
Networking & Security
All servers are behind a firewall and access is restricted to authorized personnel via VPN. Internal network communication is secured using TLS. Regular security audits are conducted. We adhere to the University of Southampton’s IT security policies.
Future Considerations
Planned upgrades include migrating to newer GPU architectures (e.g., NVIDIA H100) and expanding the storage capacity with faster NVMe drives. We are also exploring the use of serverless computing for certain inference tasks.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️