AI in Preston

From Server rental store
Jump to navigation Jump to search

AI in Preston: Server Configuration Documentation

Welcome to the documentation for the "AI in Preston" server cluster. This document details the hardware and software configuration of the servers supporting our Artificial Intelligence initiatives within the Preston data centre. This guide is aimed at newcomers to the wiki and server administration tasks. Please read carefully before attempting any modifications.

Overview

The "AI in Preston" project utilizes a distributed server architecture to handle the intensive computational demands of machine learning model training and inference. The cluster is designed for scalability and redundancy, employing a combination of high-performance compute nodes and dedicated storage servers. This documentation covers the core components and their configurations. We will cover the network topology, compute nodes, storage infrastructure, and software stack. Be sure to read the Server Access Policy before attempting to connect to any of these servers. Familiarize yourself with the Data Backup Procedures as well.

Network Topology

The server cluster is deployed within a dedicated VLAN at the Preston data centre. The network is segmented to isolate AI traffic from other services. Key network components include:

  • A core switch providing high-bandwidth connectivity between all servers.
  • A dedicated management network for out-of-band server administration.
  • A separate storage network for communication with the network-attached storage (NAS) devices.

Below is a summary of the network configuration. Refer to the Network Diagram for a visual representation.

Component IP Address Subnet Mask Gateway
Core Switch 192.168.10.1 255.255.255.0 192.168.10.254
Management Network Gateway 10.0.0.1 255.255.255.0 N/A
Storage Network Gateway 172.16.0.1 255.255.255.0 N/A

Compute Nodes

The compute nodes are responsible for performing the majority of the AI workload. They are equipped with high-end GPUs and large amounts of RAM. Each node runs a lightweight Linux distribution optimized for machine learning. See the Operating System Standard for more details. Currently, we have 8 compute nodes, designated `ai-preston-compute-01` through `ai-preston-compute-08`. Before running any jobs, please consult the Job Scheduling Policy.

Here's a detailed breakdown of the compute node specifications:

Specification Value
CPU Intel Xeon Gold 6338
RAM 256 GB DDR4 ECC
GPU NVIDIA A100 (80GB) x 4
Storage (Local) 1 TB NVMe SSD
Network Interface 100 Gbps Ethernet
Operating System Ubuntu 22.04 LTS (Custom Kernel)

Storage Infrastructure

The storage infrastructure provides persistent storage for datasets, model checkpoints, and other AI-related data. We utilize a Network Attached Storage (NAS) solution with high availability and redundancy. The NAS is managed by the Storage Administration Team. All data is backed up daily according to the Data Backup Procedures.

The following table details the NAS configuration:

Specification Value
NAS Model NetApp FAS8200
Total Capacity 1 PB
RAID Level RAID 6
File System XFS
Network Protocol NFSv4
Access Control ACLs

Software Stack

The software stack includes the core machine learning frameworks, libraries, and tools used by the AI team. All software is managed via Software Package Management and is regularly updated to ensure security and stability. We primarily use Python as the programming language, along with the following libraries:

  • TensorFlow
  • PyTorch
  • scikit-learn
  • pandas
  • numpy

The servers also include a containerization platform (Docker) for managing dependencies and ensuring reproducibility. Please refer to the Docker Usage Guidelines for details.

Security Considerations

Security is paramount. All servers are protected by a firewall and intrusion detection system. Access to the servers is restricted to authorized personnel only. Regular security audits are conducted by the Security Team. Please report any security vulnerabilities immediately. Review the Security Incident Response Plan.

Future Enhancements

Planned future enhancements include:

  • Upgrading the network infrastructure to 200 Gbps Ethernet.
  • Adding more GPU-powered compute nodes.
  • Implementing a distributed file system for improved scalability.

Related Documentation


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️