AI in Wales
AI in Wales: Server Configuration Documentation
Welcome to the documentation for the “AI in Wales” server infrastructure. This article details the server configuration used to support our artificial intelligence initiatives within the Wales-based research network. This guide is intended for new system administrators and developers joining the project. Please read carefully to understand the system architecture and required configurations.
Overview
The "AI in Wales" project utilizes a distributed server cluster to handle the computational demands of machine learning model training, inference, and data storage. The core infrastructure is built around a combination of high-performance compute nodes and a robust storage system. The system is designed for scalability and redundancy, ensuring high availability and data integrity. This document will detail the hardware, software, and network configuration of these servers. We utilize a hybrid approach, leveraging both on-premise servers and cloud resources through AWS.
Hardware Specifications
The server cluster consists of three main types of nodes: Compute Nodes, Storage Nodes, and a Management Node. Detailed specifications for each are provided below.
Compute Node Specifications | Value |
---|---|
2 x Intel Xeon Gold 6338 | |
512 GB DDR4 ECC Registered | |
4 x NVIDIA A100 80GB | |
2 x 1.92 TB NVMe SSD (RAID 0) | |
2 x 100 Gbps InfiniBand, 1 x 10 Gbps Ethernet | |
Ubuntu 22.04 LTS |
Storage Node Specifications | Value |
---|---|
2 x Intel Xeon Silver 4310 | |
256 GB DDR4 ECC Registered | |
1.2 PB Raw Capacity (Distributed across multiple drives) | |
SAS 7.2K RPM | |
RAID 6 | |
2 x 40 Gbps Ethernet | |
CentOS 8 Stream |
Management Node Specifications | Value |
---|---|
2 x Intel Xeon E-2388G | |
64 GB DDR4 ECC Registered | |
2 x 1 TB SATA SSD (RAID 1) | |
1 x 10 Gbps Ethernet | |
Debian 11 |
Software Configuration
The software stack is crucial for enabling the AI workloads. We utilize a containerized environment managed by Kubernetes for deploying and scaling applications.
- Operating Systems: As detailed above, we use a combination of Ubuntu, CentOS, and Debian for optimal performance and compatibility.
- Containerization: Docker is used to package applications and dependencies into containers.
- Orchestration: Kubernetes orchestrates the containers, managing deployment, scaling, and networking.
- Machine Learning Frameworks: TensorFlow, PyTorch, and scikit-learn are the primary machine learning frameworks supported.
- Data Storage: The storage nodes utilize a distributed file system, specifically Ceph, to provide scalable and reliable storage.
- Monitoring: Prometheus and Grafana are used for system monitoring and alerting.
- Version Control: All code is managed using Git and hosted on a private GitLab instance.
- Networking: We employ a Virtual Private Cloud (VPC) for network isolation and security.
- Security: Fail2ban is implemented to mitigate brute-force attacks. We also use iptables for firewall management.
- Automation: Ansible is used for infrastructure as code and automated deployments.
Network Topology
The server cluster is connected via a high-speed network infrastructure. The Compute Nodes have direct access to the Storage Nodes via InfiniBand, ensuring low-latency data transfer. The Management Node is connected to all other nodes via Ethernet for monitoring and administration. A dedicated 10 Gbps link connects the cluster to the external network for data ingestion and model deployment. Detailed network diagrams are available on the internal wiki.
Security Considerations
Security is a paramount concern. All servers are behind a firewall and access is restricted to authorized personnel only. Regular security audits are conducted to identify and address vulnerabilities. Data is encrypted both in transit and at rest. We adhere to the principles of least privilege and regularly update all software to patch security holes.
Future Expansion
We plan to expand the cluster with additional Compute Nodes and Storage Nodes to meet the growing demands of our AI research. We are also exploring the use of GPU virtualization to improve resource utilization. Further integration with cloud services is also planned.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️