AI in Swindon

From Server rental store
Revision as of 08:32, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. AI in Swindon: Server Configuration

This article details the server configuration powering the "AI in Swindon" project. This project utilizes a cluster of servers located in our Swindon data center to run various machine learning models, primarily focused on local traffic analysis and predictive maintenance of railway infrastructure. This guide is intended for new system administrators and developers joining the project to provide a comprehensive overview of the infrastructure.

Overview

The "AI in Swindon" project relies on a distributed system architecture. We utilize a combination of bare-metal servers and virtual machines (VMs) orchestrated using Kubernetes. The core processing is handled by GPU-equipped servers, while supporting services like data storage, monitoring, and web interfaces run on dedicated VMs. The network is a key component, utilizing a high-bandwidth, low-latency Ethernet fabric. This allows for efficient communication between servers and minimizes data transfer bottlenecks. We've chosen a hybrid cloud approach, leveraging on-premise resources for sensitive data and cloud services for scalability during peak demand. The security infrastructure is paramount, with multiple layers of protection including firewalls, intrusion detection systems, and regular security audits. A key component of the system is the data pipeline, which ensures a consistent flow of information from sensors to the AI models.

Hardware Specifications

The project utilizes three primary server types: GPU Servers, Storage Servers, and Control Plane Servers.

GPU Servers

These servers are responsible for the computationally intensive tasks of training and inference of machine learning models.

Specification Value
CPU Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM 512 GB DDR4 ECC REG 3200MHz
GPU 4x NVIDIA A100 80GB
Storage 1x 1.92TB NVMe SSD (OS & Applications) 4x 18TB SAS HDD (Data Storage)
Network Dual 100GbE NICs (Mellanox ConnectX-6)
Power Supply 2x 2000W Redundant PSU

These servers are configured with Red Hat Enterprise Linux and utilize the NVIDIA Driver Stack for optimal GPU performance. We utilize Docker containers to isolate workloads and ensure reproducibility.

Storage Servers

These servers provide persistent storage for the project's datasets and model artifacts.

Specification Value
CPU Intel Xeon Silver 4310 (12 cores/24 threads)
RAM 256 GB DDR4 ECC REG 3200MHz
Storage 12x 18TB SAS HDD (RAID 6)
Network Dual 25GbE NICs
Filesystem ZFS
Power Supply 2x 1200W Redundant PSU

Data is backed up regularly to an offsite location using rsync. The ZFS filesystem offers built-in data integrity features, protecting against data corruption. Access control is managed through LDAP integration.

Control Plane Servers

These servers host the Kubernetes control plane and other essential system services.

Specification Value
CPU Intel Xeon Gold 6248R (24 cores/48 threads)
RAM 128 GB DDR4 ECC REG 3200MHz
Storage 2x 960GB NVMe SSD (RAID 1)
Network Dual 10GbE NICs
Operating System Ubuntu Server 22.04 LTS
Power Supply 2x 850W Redundant PSU

These servers are monitored closely using Prometheus and Grafana. We use Ansible for automated configuration management.

Software Stack

The software stack is built around a core of open-source technologies.

Network Configuration

The network is segmented into three zones: a public zone for external access, a private zone for internal communication, and a management zone for administrative access. Firewalls are used to restrict traffic between zones. The inter-server communication within the private zone is handled by a high-speed InfiniBand network. DNS resolution is managed by internal servers for increased reliability.

Security Considerations

Security is a top priority. All servers are hardened according to industry best practices. Regular security audits are conducted to identify and address vulnerabilities. Two-factor authentication is required for all administrative access. Data encryption is used both in transit and at rest. We adhere to all relevant data privacy regulations.

Future Expansion

We anticipate expanding the cluster to accommodate growing data volumes and increasing computational demands. We are evaluating the use of NVMe over Fabrics to further improve storage performance. We are also exploring the integration of federated learning techniques to enable collaborative model training without sharing sensitive data.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️