Server rental store

AI in the Ural Mountains

AI in the Ural Mountains: Server Configuration

This article details the server configuration for our "AI in the Ural Mountains" project, a distributed computing initiative focused on processing geological data using machine learning algorithms. This guide is aimed at new team members responsible for server maintenance and deployment. It covers hardware, software, networking, and security aspects of the system.

Overview

The project utilizes a cluster of servers located in a secure facility within the Ural Mountains. The primary goal is to analyze seismic data, mineral composition scans, and historical geological surveys to identify potential resource deposits and predict geological events. The server architecture is designed for high throughput, scalability, and redundancy. The operating system of choice is Ubuntu Server 22.04 LTS, due to its stability, community support, and compatibility with the necessary machine learning frameworks.

Hardware Configuration

The cluster consists of 20 identical servers, with one designated as the master node. Each server is built with the following specifications:

Component Specification
CPU AMD EPYC 7763 (64 Cores, 128 Threads)
RAM 256 GB DDR4 ECC Registered RAM
Storage (OS) 1 TB NVMe SSD
Storage (Data) 16 TB SAS HDD (RAID 6)
Network Interface Dual 100 GbE Ethernet
Power Supply Redundant 1600W Platinum PSUs

The master node has slightly enhanced specifications for coordinating the cluster. These are detailed below:

Component Specification
CPU AMD EPYC 7763 (64 Cores, 128 Threads)
RAM 512 GB DDR4 ECC Registered RAM
Storage (OS) 2 TB NVMe SSD (RAID 1)
Storage (Data) 32 TB SAS HDD (RAID 6)
Network Interface Quad 100 GbE Ethernet

A dedicated Network Attached Storage (NAS) device with 1PB of capacity is used for long-term data archiving. All servers are housed in a temperature and humidity-controlled data center with redundant power and cooling systems. See Data Center Redundancy for more details.

Software Stack

Each server runs a standardized software stack, ensuring consistency and ease of management.

Software Version Purpose
Operating System Ubuntu Server 22.04 LTS Base OS
Python 3.10 Primary programming language
TensorFlow 2.12 Machine Learning Framework
PyTorch 2.0 Alternative Machine Learning Framework
CUDA Toolkit 12.1 GPU Acceleration
Docker 20.10 Containerization
Kubernetes 1.26 Container Orchestration
SSH Server OpenSSH 8.2 Remote Access

We utilize Docker and Kubernetes for containerization and orchestration, allowing for efficient resource utilization and simplified deployment of machine learning models. The master node also runs a Prometheus instance for monitoring and alerting. Detailed instructions for setting up the software stack are available on the Software Installation Guide page.

Networking Configuration

The servers are connected via a dedicated 100 GbE network. The network topology is a Clos network, providing high bandwidth and low latency. A dedicated VLAN is used for inter-server communication, and another for external access. The master node acts as the network gateway. Firewall rules are configured using iptables to restrict access to essential services. The network configuration details are documented in the Network Diagram. We also employ DNS for service discovery within the cluster.

Security Considerations

Security is paramount. The following measures are in place:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️