Server rental store

AI Platforms

# AI Platforms: Server Configuration Guide

This article details the server configuration for our dedicated AI Platforms, designed to support machine learning workloads. This guide is intended for new system administrators and engineers getting acquainted with the environment.

Overview

The AI Platforms are built on a cluster of high-performance servers specifically configured for demanding AI/ML tasks. These platforms support a variety of frameworks including TensorFlow, PyTorch, and scikit-learn. The core infrastructure is designed for scalability and resilience, leveraging redundant components and automated failover mechanisms. Access to these platforms is managed through User Account Control and requires specific permissions granted by the Security Team.

Hardware Specifications

The current generation of AI Platform servers utilize a standardized hardware configuration to simplify management and ensure consistent performance. Details are outlined below.

Component Specification Quantity per Server
CPU Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU) 2
RAM 512 GB DDR4 ECC Registered 3200 MHz 1
GPU NVIDIA A100 80GB PCIe 4.0 8
Storage (OS) 500 GB NVMe SSD 1
Storage (Data) 8 TB NVMe SSD (RAID 0) 1
Network Interface 2 x 100 GbE Mellanox ConnectX-6 1

The servers are housed in a dedicated, climate-controlled data center with redundant power supplies and network connectivity. Detailed rack diagrams are available on the Data Center Documentation page.

Software Stack

The AI Platforms run a customized Linux distribution based on Ubuntu Server 20.04 LTS. This distribution is hardened for security and optimized for machine learning workloads. Key software components include:

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️