AI Platforms

# AI Platforms: Server Configuration Guide

This article details the server configuration for our dedicated AI Platforms, designed to support machine learning workloads. This guide is intended for new system administrators and engineers getting acquainted with the environment.

Overview

The AI Platforms are built on a cluster of high-performance servers specifically configured for demanding AI/ML tasks. These platforms support a variety of frameworks including TensorFlow, PyTorch, and scikit-learn. The core infrastructure is designed for scalability and resilience, leveraging redundant components and automated failover mechanisms. Access to these platforms is managed through User Account Control and requires specific permissions granted by the Security Team.

Hardware Specifications

The current generation of AI Platform servers utilize a standardized hardware configuration to simplify management and ensure consistent performance. Details are outlined below.

Component	Specification	Quantity per Server
CPU	Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)	2
RAM	512 GB DDR4 ECC Registered 3200 MHz	1
GPU	NVIDIA A100 80GB PCIe 4.0	8
Storage (OS)	500 GB NVMe SSD	1
Storage (Data)	8 TB NVMe SSD (RAID 0)	1
Network Interface	2 x 100 GbE Mellanox ConnectX-6	1

The servers are housed in a dedicated, climate-controlled data center with redundant power supplies and network connectivity. Detailed rack diagrams are available on the Data Center Documentation page.

Software Stack

The AI Platforms run a customized Linux distribution based on Ubuntu Server 20.04 LTS. This distribution is hardened for security and optimized for machine learning workloads. Key software components include:

CUDA Toolkit: Version 11.8 for GPU acceleration.
cuDNN: Version 8.6 for deep neural networks.
NCCL: Version 2.14 for multi-GPU communication.
Docker: Version 20.10 for containerization. See Docker Usage Guidelines for more information.
Kubernetes: Version 1.23 for container orchestration. Managed via Kubernetes Dashboard.
NFS: Network File System for shared storage. Configuration details are found in the Network Storage Documentation.

Network Configuration

The AI Platforms are connected to the internal network via a dedicated VLAN. This separation enhances security and improves performance.

Parameter	Value
VLAN ID	1001
Subnet Mask	255.255.255.0
Gateway	192.168.1001.1
DNS Servers	192.168.1.10, 192.168.1.11

All communication between servers within the cluster utilizes this VLAN. External access is restricted and controlled through a dedicated Firewall Configuration. Inter-server communication is optimized using RDMA over Converged Ethernet (RoCE). Further details on network troubleshooting can be found in the Network Troubleshooting Guide.

Storage Configuration

Data storage is provided by a combination of local NVMe SSDs and a centralized Network File System (NFS). The local SSDs are used for fast access to frequently used data, while the NFS provides a shared repository for larger datasets.

Storage Type	Capacity	Protocol	Purpose
Local NVMe SSD	8 TB	PCIe 4.0	Model Training Data, Temporary Files
NFS Share	1 PB	NFSv4.1	Large Datasets, Model Artifacts

Regular backups of the NFS share are performed nightly and stored in a geographically separate location. Backup and restore procedures are documented in the Disaster Recovery Plan. Access permissions to the NFS share are managed through Access Control Lists.

Monitoring and Logging

Comprehensive monitoring and logging are essential for maintaining the health and performance of the AI Platforms. We utilize the following tools:

Prometheus: For collecting and storing metrics. See Prometheus Monitoring.
Grafana: For visualizing metrics. Access Grafana dashboards via Grafana Access.
ELK Stack (Elasticsearch, Logstash, Kibana): For centralized logging. Documentation is available on the Logging Infrastructure page.
Nagios: For system monitoring and alerting. Details on alert configuration are in the Nagios Configuration Guide.

Security Considerations

Security is paramount. The AI Platforms are subject to regular security audits and vulnerability assessments. Key security measures include:

Two-factor authentication for all user accounts.
Strict firewall rules.
Intrusion detection and prevention systems.
Regular software updates and patching.
Data encryption at rest and in transit.

Please refer to the Security Policy for detailed information.

Category:Server Hardware

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️