AI in Wales

From Server rental store
Revision as of 08:56, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

AI in Wales: Server Configuration Documentation

Welcome to the documentation for the “AI in Wales” server infrastructure. This article details the server configuration used to support our artificial intelligence initiatives within the Wales-based research network. This guide is intended for new system administrators and developers joining the project. Please read carefully to understand the system architecture and required configurations.

Overview

The "AI in Wales" project utilizes a distributed server cluster to handle the computational demands of machine learning model training, inference, and data storage. The core infrastructure is built around a combination of high-performance compute nodes and a robust storage system. The system is designed for scalability and redundancy, ensuring high availability and data integrity. This document will detail the hardware, software, and network configuration of these servers. We utilize a hybrid approach, leveraging both on-premise servers and cloud resources through AWS.

Hardware Specifications

The server cluster consists of three main types of nodes: Compute Nodes, Storage Nodes, and a Management Node. Detailed specifications for each are provided below.

Compute Node Specifications Value
2 x Intel Xeon Gold 6338
512 GB DDR4 ECC Registered
4 x NVIDIA A100 80GB
2 x 1.92 TB NVMe SSD (RAID 0)
2 x 100 Gbps InfiniBand, 1 x 10 Gbps Ethernet
Ubuntu 22.04 LTS
Storage Node Specifications Value
2 x Intel Xeon Silver 4310
256 GB DDR4 ECC Registered
1.2 PB Raw Capacity (Distributed across multiple drives)
SAS 7.2K RPM
RAID 6
2 x 40 Gbps Ethernet
CentOS 8 Stream
Management Node Specifications Value
2 x Intel Xeon E-2388G
64 GB DDR4 ECC Registered
2 x 1 TB SATA SSD (RAID 1)
1 x 10 Gbps Ethernet
Debian 11

Software Configuration

The software stack is crucial for enabling the AI workloads. We utilize a containerized environment managed by Kubernetes for deploying and scaling applications.

  • Operating Systems: As detailed above, we use a combination of Ubuntu, CentOS, and Debian for optimal performance and compatibility.
  • Containerization: Docker is used to package applications and dependencies into containers.
  • Orchestration: Kubernetes orchestrates the containers, managing deployment, scaling, and networking.
  • Machine Learning Frameworks: TensorFlow, PyTorch, and scikit-learn are the primary machine learning frameworks supported.
  • Data Storage: The storage nodes utilize a distributed file system, specifically Ceph, to provide scalable and reliable storage.
  • Monitoring: Prometheus and Grafana are used for system monitoring and alerting.
  • Version Control: All code is managed using Git and hosted on a private GitLab instance.
  • Networking: We employ a Virtual Private Cloud (VPC) for network isolation and security.
  • Security: Fail2ban is implemented to mitigate brute-force attacks. We also use iptables for firewall management.
  • Automation: Ansible is used for infrastructure as code and automated deployments.

Network Topology

The server cluster is connected via a high-speed network infrastructure. The Compute Nodes have direct access to the Storage Nodes via InfiniBand, ensuring low-latency data transfer. The Management Node is connected to all other nodes via Ethernet for monitoring and administration. A dedicated 10 Gbps link connects the cluster to the external network for data ingestion and model deployment. Detailed network diagrams are available on the internal wiki.

Security Considerations

Security is a paramount concern. All servers are behind a firewall and access is restricted to authorized personnel only. Regular security audits are conducted to identify and address vulnerabilities. Data is encrypted both in transit and at rest. We adhere to the principles of least privilege and regularly update all software to patch security holes.

Future Expansion

We plan to expand the cluster with additional Compute Nodes and Storage Nodes to meet the growing demands of our AI research. We are also exploring the use of GPU virtualization to improve resource utilization. Further integration with cloud services is also planned.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️