AI in the Andes

From Server rental store
Revision as of 09:12, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

AI in the Andes: Server Configuration Documentation

Welcome to the documentation for "AI in the Andes," our high-performance server cluster dedicated to Artificial Intelligence and Machine Learning workloads. This article aims to provide a comprehensive overview of the server configuration for newcomers to the system. Understanding this setup is crucial for effective job submission, resource allocation, and troubleshooting. This documentation focuses on the core hardware and software components.

Overview

"AI in the Andes" is a distributed computing environment built to accelerate research in areas like deep learning, natural language processing, and computer vision. The cluster comprises interconnected servers, each optimized for computational intensity. The system utilizes a hybrid architecture, combining CPU and GPU resources for optimal performance across a wide range of AI tasks. We utilize Slurm for workload management and Anaconda for environment control. This document details the specifics of each server node within the cluster. It is highly recommended to review our Job Submission Guide before putting workloads onto the system. Understanding System Limits is also vital. Please also read our Security Policy before accessing the system.

Hardware Specifications

The cluster consists of 10 identical server nodes, designated 'andes01' through 'andes10'. Each node is built with the following specifications:

Component Specification
CPU 2 x Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM 512 GB DDR4 ECC Registered RAM @ 3200MHz
GPU 4 x NVIDIA A100 80GB PCIe 4.0
Storage (Local) 2 x 4TB NVMe PCIe 4.0 SSD (RAID 0)
Network Interface 2 x 200Gbps InfiniBand
Power Supply 2 x 2000W Platinum Redundant Power Supplies

These specifications provide substantial processing power and memory capacity for demanding AI applications. The high-speed networking ensures efficient data transfer between nodes for distributed training. Refer to the Networking Guide for more details on InfiniBand configuration. The local storage is primarily for temporary files and caching; long-term storage resides on our Network File System.

Software Environment

Each server node runs a consistent software stack. This ensures compatibility and simplifies deployment of AI frameworks.

Software Version
Operating System CentOS Linux 7.9
CUDA Toolkit 11.8
cuDNN 8.6.0
NVIDIA Driver 525.85.05
Python 3.9
Anaconda 2023.03
Slurm 22.06
MPI OpenMPI 4.1.4

All users are encouraged to utilize the Anaconda environment manager to create isolated environments for their projects. This prevents dependency conflicts and ensures reproducibility. See our Anaconda Tutorial for detailed instructions. We also provide pre-configured environments for popular frameworks like TensorFlow, PyTorch, and Keras. The Software Catalog lists all available software.

Storage System

The "AI in the Andes" cluster utilizes a shared network file system (NFS) for persistent storage. This allows users to access their data from any node in the cluster.

Mount Point Capacity Purpose
/home/[username] 10 TB User Home Directory (Personal files, scripts)
/data/shared 100 TB Shared Data Storage (Datasets, project files)
/tmp 1 TB per node Temporary Storage (Automatically purged)

Access to the `/data/shared` directory requires appropriate permissions, which can be requested through the Support Portal. We strongly recommend storing all persistent data on the NFS file system rather than on the local SSDs. The Data Backup Policy details our data protection procedures.

Networking Configuration

The cluster's high-performance networking is based on InfiniBand. This provides low-latency, high-bandwidth communication between nodes, crucial for distributed training. Each node has two 200Gbps InfiniBand ports connected to a Fat Tree topology. The network is managed by a dedicated InfiniBand switch. Users should be aware of the InfiniBand Best Practices to optimize network performance. Firewall rules are managed by the System Administrators and are not directly configurable by users.

Access and Support

Access to the "AI in the Andes" cluster is granted through a user account and SSH key authentication. Detailed instructions can be found in the Account Creation Guide. For technical support, please submit a ticket through the Support Portal. We also maintain a Frequently Asked Questions page that addresses common issues.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️