AI in the Alps

AI in the Alps: Server Configuration Documentation

Welcome to the documentation for the "AI in the Alps" server clusterThis article details the hardware and software configuration of our high-performance computing environment dedicated to Artificial Intelligence research, located in a secure, climate-controlled facility in the Swiss Alps. This guide is intended for new users and system administrators who will be interacting with the cluster. Please read this document carefully before attempting any configuration changes. Understanding the system architecture is crucial for effective use and maintenance.

Overview

The "AI in the Alps" cluster is designed for demanding AI workloads, specifically focusing on deep learning, natural language processing, and computer vision. The cluster utilizes a distributed architecture, leveraging multiple interconnected servers to provide significant processing power and storage capacity. We utilize a hybrid cloud approach, with primary processing done on-premise and burst capacity available through a secure connection to a private cloud provider. This setup balances performance, cost, and data security. More information about our security protocols can be found on the Security Policies page. The cluster is managed using Ansible for automated configuration and deployment.

Hardware Configuration

The core of the cluster consists of ten identical compute nodes. Each node is built around a powerful processor and a dedicated graphics processing unit (GPU). A separate storage array provides shared access to a large dataset. Details of the hardware are outlined below:

Component	Specification	Quantity per Node
CPU	AMD EPYC 7763 (64-Core)	1
GPU	NVIDIA A100 (80GB)	1
RAM	512GB DDR4 ECC Registered	1
Local Storage (OS)	1TB NVMe SSD	1
Network Interface	2 x 100GbE Mellanox ConnectX-6	1

The cluster also includes a dedicated management node for monitoring and control. This node runs Prometheus for system metrics and Grafana for visualization. The network infrastructure utilizes a non-blocking InfiniBand topology for low-latency communication between nodes.

Storage Configuration

Data storage is handled by a dedicated high-performance storage array. This array is crucial for providing fast access to the large datasets required for AI training. The storage array is a clustered system, providing redundancy and scalability.

Attribute	Value
Storage Type	NVMe SSD
Total Capacity	2 Petabytes
File System	Lustre
RAID Level	RAID 6
Network Connection	4 x 200GbE InfiniBand

Access to the storage array is managed through a dedicated NFS server. Data backups are performed nightly to a remote location using rsync. Detailed information about data access and backup procedures can be found on the Data Management page. Regular Disk Health Checks are performed to ensure data integrity.

Software Configuration

Each compute node runs a customized distribution of Ubuntu Server 22.04 LTS. The core software stack includes the following components:

Software	Version	Purpose
Operating System	Ubuntu Server 22.04 LTS	Base Operating System
CUDA Toolkit	12.2	GPU Programming Toolkit
cuDNN	8.9	Deep Neural Network Library
PyTorch	2.0.1	Deep Learning Framework
TensorFlow	2.13.0	Deep Learning Framework
NCCL	2.14	Multi-GPU Communication Library

All software is managed using Docker containers to ensure reproducibility and isolation. We utilize a custom CI/CD pipeline based on Jenkins for automated software deployment and updates. The cluster leverages Kubernetes for container orchestration. We have documented our containerization best practices on the Docker Usage Guide page. The system utilizes a centralized logging system based on ELK Stack (Elasticsearch, Logstash, Kibana) for troubleshooting and monitoring.

Networking

The network is logically divided into three segments: Management, Compute, and Storage. The Management network is used for administrative access to the nodes, while the Compute network is used for inter-node communication during AI training. The Storage network provides high-bandwidth access to the storage array. Detailed network diagrams and IP addressing schemes are available on the Network Topology page. Firewall Rules are strictly enforced to ensure network security.

Future Improvements

We are continually working to improve the performance and capabilities of the "AI in the Alps" cluster. Planned upgrades include:

Adding more GPU nodes to increase processing power.
Migrating to a faster interconnect technology.
Integrating a dedicated machine learning model registry.
Implementing automated resource allocation based on workload requirements.

Contact Support if you encounter any issues or have any questions.

Category:Server Hardware

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️