AI in the Alps
AI in the Alps: Server Configuration Documentation
Welcome to the documentation for the "AI in the Alps" server cluster! This article details the hardware and software configuration of our high-performance computing environment dedicated to Artificial Intelligence research, located in a secure, climate-controlled facility in the Swiss Alps. This guide is intended for new users and system administrators who will be interacting with the cluster. Please read this document carefully before attempting any configuration changes. Understanding the system architecture is crucial for effective use and maintenance.
Overview
The "AI in the Alps" cluster is designed for demanding AI workloads, specifically focusing on deep learning, natural language processing, and computer vision. The cluster utilizes a distributed architecture, leveraging multiple interconnected servers to provide significant processing power and storage capacity. We utilize a hybrid cloud approach, with primary processing done on-premise and burst capacity available through a secure connection to a private cloud provider. This setup balances performance, cost, and data security. More information about our security protocols can be found on the Security Policies page. The cluster is managed using Ansible for automated configuration and deployment.
Hardware Configuration
The core of the cluster consists of ten identical compute nodes. Each node is built around a powerful processor and a dedicated graphics processing unit (GPU). A separate storage array provides shared access to a large dataset. Details of the hardware are outlined below:
Component | Specification | Quantity per Node |
---|---|---|
CPU | AMD EPYC 7763 (64-Core) | 1 |
GPU | NVIDIA A100 (80GB) | 1 |
RAM | 512GB DDR4 ECC Registered | 1 |
Local Storage (OS) | 1TB NVMe SSD | 1 |
Network Interface | 2 x 100GbE Mellanox ConnectX-6 | 1 |
The cluster also includes a dedicated management node for monitoring and control. This node runs Prometheus for system metrics and Grafana for visualization. The network infrastructure utilizes a non-blocking InfiniBand topology for low-latency communication between nodes.
Storage Configuration
Data storage is handled by a dedicated high-performance storage array. This array is crucial for providing fast access to the large datasets required for AI training. The storage array is a clustered system, providing redundancy and scalability.
Attribute | Value |
---|---|
Storage Type | NVMe SSD |
Total Capacity | 2 Petabytes |
File System | Lustre |
RAID Level | RAID 6 |
Network Connection | 4 x 200GbE InfiniBand |
Access to the storage array is managed through a dedicated NFS server. Data backups are performed nightly to a remote location using rsync. Detailed information about data access and backup procedures can be found on the Data Management page. Regular Disk Health Checks are performed to ensure data integrity.
Software Configuration
Each compute node runs a customized distribution of Ubuntu Server 22.04 LTS. The core software stack includes the following components:
Software | Version | Purpose |
---|---|---|
Operating System | Ubuntu Server 22.04 LTS | Base Operating System |
CUDA Toolkit | 12.2 | GPU Programming Toolkit |
cuDNN | 8.9 | Deep Neural Network Library |
PyTorch | 2.0.1 | Deep Learning Framework |
TensorFlow | 2.13.0 | Deep Learning Framework |
NCCL | 2.14 | Multi-GPU Communication Library |
All software is managed using Docker containers to ensure reproducibility and isolation. We utilize a custom CI/CD pipeline based on Jenkins for automated software deployment and updates. The cluster leverages Kubernetes for container orchestration. We have documented our containerization best practices on the Docker Usage Guide page. The system utilizes a centralized logging system based on ELK Stack (Elasticsearch, Logstash, Kibana) for troubleshooting and monitoring.
Networking
The network is logically divided into three segments: Management, Compute, and Storage. The Management network is used for administrative access to the nodes, while the Compute network is used for inter-node communication during AI training. The Storage network provides high-bandwidth access to the storage array. Detailed network diagrams and IP addressing schemes are available on the Network Topology page. Firewall Rules are strictly enforced to ensure network security.
Future Improvements
We are continually working to improve the performance and capabilities of the "AI in the Alps" cluster. Planned upgrades include:
- Adding more GPU nodes to increase processing power.
- Migrating to a faster interconnect technology.
- Integrating a dedicated machine learning model registry.
- Implementing automated resource allocation based on workload requirements.
Contact Support if you encounter any issues or have any questions.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️