Server rental store

AI in the Alps

AI in the Alps: Server Configuration Documentation

Welcome to the documentation for the "AI in the Alps" server clusterThis article details the hardware and software configuration of our high-performance computing environment dedicated to Artificial Intelligence research, located in a secure, climate-controlled facility in the Swiss Alps. This guide is intended for new users and system administrators who will be interacting with the cluster. Please read this document carefully before attempting any configuration changes. Understanding the system architecture is crucial for effective use and maintenance.

Overview

The "AI in the Alps" cluster is designed for demanding AI workloads, specifically focusing on deep learning, natural language processing, and computer vision. The cluster utilizes a distributed architecture, leveraging multiple interconnected servers to provide significant processing power and storage capacity. We utilize a hybrid cloud approach, with primary processing done on-premise and burst capacity available through a secure connection to a private cloud provider. This setup balances performance, cost, and data security. More information about our security protocols can be found on the Security Policies page. The cluster is managed using Ansible for automated configuration and deployment.

Hardware Configuration

The core of the cluster consists of ten identical compute nodes. Each node is built around a powerful processor and a dedicated graphics processing unit (GPU). A separate storage array provides shared access to a large dataset. Details of the hardware are outlined below:

Component Specification Quantity per Node
CPU AMD EPYC 7763 (64-Core) 1
GPU NVIDIA A100 (80GB) 1
RAM 512GB DDR4 ECC Registered 1
Local Storage (OS) 1TB NVMe SSD 1
Network Interface 2 x 100GbE Mellanox ConnectX-6 1

The cluster also includes a dedicated management node for monitoring and control. This node runs Prometheus for system metrics and Grafana for visualization. The network infrastructure utilizes a non-blocking InfiniBand topology for low-latency communication between nodes.

Storage Configuration

Data storage is handled by a dedicated high-performance storage array. This array is crucial for providing fast access to the large datasets required for AI training. The storage array is a clustered system, providing redundancy and scalability.

Attribute Value
Storage Type NVMe SSD
Total Capacity 2 Petabytes
File System Lustre
RAID Level RAID 6
Network Connection 4 x 200GbE InfiniBand

Access to the storage array is managed through a dedicated NFS server. Data backups are performed nightly to a remote location using rsync. Detailed information about data access and backup procedures can be found on the Data Management page. Regular Disk Health Checks are performed to ensure data integrity.

Software Configuration

Each compute node runs a customized distribution of Ubuntu Server 22.04 LTS. The core software stack includes the following components:

Software Version Purpose
Operating System Ubuntu Server 22.04 LTS Base Operating System
CUDA Toolkit 12.2 GPU Programming Toolkit
cuDNN 8.9 Deep Neural Network Library
PyTorch 2.0.1 Deep Learning Framework
TensorFlow 2.13.0 Deep Learning Framework
NCCL 2.14 Multi-GPU Communication Library

All software is managed using Docker containers to ensure reproducibility and isolation. We utilize a custom CI/CD pipeline based on Jenkins for automated software deployment and updates. The cluster leverages Kubernetes for container orchestration. We have documented our containerization best practices on the Docker Usage Guide page. The system utilizes a centralized logging system based on ELK Stack (Elasticsearch, Logstash, Kibana) for troubleshooting and monitoring.

Networking

The network is logically divided into three segments: Management, Compute, and Storage. The Management network is used for administrative access to the nodes, while the Compute network is used for inter-node communication during AI training. The Storage network provides high-bandwidth access to the storage array. Detailed network diagrams and IP addressing schemes are available on the Network Topology page. Firewall Rules are strictly enforced to ensure network security.

Future Improvements

We are continually working to improve the performance and capabilities of the "AI in the Alps" cluster. Planned upgrades include:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️