AI in the Andes
AI in the Andes: Server Configuration Documentation
Welcome to the documentation for "AI in the Andes," our high-performance server cluster dedicated to Artificial Intelligence and Machine Learning workloads. This article aims to provide a comprehensive overview of the server configuration for newcomers to the system. Understanding this setup is crucial for effective job submission, resource allocation, and troubleshooting. This documentation focuses on the core hardware and software components.
Overview
"AI in the Andes" is a distributed computing environment built to accelerate research in areas like deep learning, natural language processing, and computer vision. The cluster comprises interconnected servers, each optimized for computational intensity. The system utilizes a hybrid architecture, combining CPU and GPU resources for optimal performance across a wide range of AI tasks. We utilize Slurm for workload management and Anaconda for environment control. This document details the specifics of each server node within the cluster. It is highly recommended to review our Job Submission Guide before putting workloads onto the system. Understanding System Limits is also vital. Please also read our Security Policy before accessing the system.
Hardware Specifications
The cluster consists of 10 identical server nodes, designated 'andes01' through 'andes10'. Each node is built with the following specifications:
Component | Specification |
---|---|
CPU | 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 512 GB DDR4 ECC Registered RAM @ 3200MHz |
GPU | 4 x NVIDIA A100 80GB PCIe 4.0 |
Storage (Local) | 2 x 4TB NVMe PCIe 4.0 SSD (RAID 0) |
Network Interface | 2 x 200Gbps InfiniBand |
Power Supply | 2 x 2000W Platinum Redundant Power Supplies |
These specifications provide substantial processing power and memory capacity for demanding AI applications. The high-speed networking ensures efficient data transfer between nodes for distributed training. Refer to the Networking Guide for more details on InfiniBand configuration. The local storage is primarily for temporary files and caching; long-term storage resides on our Network File System.
Software Environment
Each server node runs a consistent software stack. This ensures compatibility and simplifies deployment of AI frameworks.
Software | Version |
---|---|
Operating System | CentOS Linux 7.9 |
CUDA Toolkit | 11.8 |
cuDNN | 8.6.0 |
NVIDIA Driver | 525.85.05 |
Python | 3.9 |
Anaconda | 2023.03 |
Slurm | 22.06 |
MPI | OpenMPI 4.1.4 |
All users are encouraged to utilize the Anaconda environment manager to create isolated environments for their projects. This prevents dependency conflicts and ensures reproducibility. See our Anaconda Tutorial for detailed instructions. We also provide pre-configured environments for popular frameworks like TensorFlow, PyTorch, and Keras. The Software Catalog lists all available software.
Storage System
The "AI in the Andes" cluster utilizes a shared network file system (NFS) for persistent storage. This allows users to access their data from any node in the cluster.
Mount Point | Capacity | Purpose |
---|---|---|
/home/[username] | 10 TB | User Home Directory (Personal files, scripts) |
/data/shared | 100 TB | Shared Data Storage (Datasets, project files) |
/tmp | 1 TB per node | Temporary Storage (Automatically purged) |
Access to the `/data/shared` directory requires appropriate permissions, which can be requested through the Support Portal. We strongly recommend storing all persistent data on the NFS file system rather than on the local SSDs. The Data Backup Policy details our data protection procedures.
Networking Configuration
The cluster's high-performance networking is based on InfiniBand. This provides low-latency, high-bandwidth communication between nodes, crucial for distributed training. Each node has two 200Gbps InfiniBand ports connected to a Fat Tree topology. The network is managed by a dedicated InfiniBand switch. Users should be aware of the InfiniBand Best Practices to optimize network performance. Firewall rules are managed by the System Administrators and are not directly configurable by users.
Access and Support
Access to the "AI in the Andes" cluster is granted through a user account and SSH key authentication. Detailed instructions can be found in the Account Creation Guide. For technical support, please submit a ticket through the Support Portal. We also maintain a Frequently Asked Questions page that addresses common issues.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️