AI Platforms
- AI Platforms: Server Configuration Guide
This article details the server configuration for our dedicated AI Platforms, designed to support machine learning workloads. This guide is intended for new system administrators and engineers getting acquainted with the environment.
Overview
The AI Platforms are built on a cluster of high-performance servers specifically configured for demanding AI/ML tasks. These platforms support a variety of frameworks including TensorFlow, PyTorch, and scikit-learn. The core infrastructure is designed for scalability and resilience, leveraging redundant components and automated failover mechanisms. Access to these platforms is managed through User Account Control and requires specific permissions granted by the Security Team.
Hardware Specifications
The current generation of AI Platform servers utilize a standardized hardware configuration to simplify management and ensure consistent performance. Details are outlined below.
Component | Specification | Quantity per Server |
---|---|---|
CPU | Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU) | 2 |
RAM | 512 GB DDR4 ECC Registered 3200 MHz | 1 |
GPU | NVIDIA A100 80GB PCIe 4.0 | 8 |
Storage (OS) | 500 GB NVMe SSD | 1 |
Storage (Data) | 8 TB NVMe SSD (RAID 0) | 1 |
Network Interface | 2 x 100 GbE Mellanox ConnectX-6 | 1 |
The servers are housed in a dedicated, climate-controlled data center with redundant power supplies and network connectivity. Detailed rack diagrams are available on the Data Center Documentation page.
Software Stack
The AI Platforms run a customized Linux distribution based on Ubuntu Server 20.04 LTS. This distribution is hardened for security and optimized for machine learning workloads. Key software components include:
- CUDA Toolkit: Version 11.8 for GPU acceleration.
- cuDNN: Version 8.6 for deep neural networks.
- NCCL: Version 2.14 for multi-GPU communication.
- Docker: Version 20.10 for containerization. See Docker Usage Guidelines for more information.
- Kubernetes: Version 1.23 for container orchestration. Managed via Kubernetes Dashboard.
- NFS: Network File System for shared storage. Configuration details are found in the Network Storage Documentation.
Network Configuration
The AI Platforms are connected to the internal network via a dedicated VLAN. This separation enhances security and improves performance.
Parameter | Value |
---|---|
VLAN ID | 1001 |
Subnet Mask | 255.255.255.0 |
Gateway | 192.168.1001.1 |
DNS Servers | 192.168.1.10, 192.168.1.11 |
All communication between servers within the cluster utilizes this VLAN. External access is restricted and controlled through a dedicated Firewall Configuration. Inter-server communication is optimized using RDMA over Converged Ethernet (RoCE). Further details on network troubleshooting can be found in the Network Troubleshooting Guide.
Storage Configuration
Data storage is provided by a combination of local NVMe SSDs and a centralized Network File System (NFS). The local SSDs are used for fast access to frequently used data, while the NFS provides a shared repository for larger datasets.
Storage Type | Capacity | Protocol | Purpose |
---|---|---|---|
Local NVMe SSD | 8 TB | PCIe 4.0 | Model Training Data, Temporary Files |
NFS Share | 1 PB | NFSv4.1 | Large Datasets, Model Artifacts |
Regular backups of the NFS share are performed nightly and stored in a geographically separate location. Backup and restore procedures are documented in the Disaster Recovery Plan. Access permissions to the NFS share are managed through Access Control Lists.
Monitoring and Logging
Comprehensive monitoring and logging are essential for maintaining the health and performance of the AI Platforms. We utilize the following tools:
- Prometheus: For collecting and storing metrics. See Prometheus Monitoring.
- Grafana: For visualizing metrics. Access Grafana dashboards via Grafana Access.
- ELK Stack (Elasticsearch, Logstash, Kibana): For centralized logging. Documentation is available on the Logging Infrastructure page.
- Nagios: For system monitoring and alerting. Details on alert configuration are in the Nagios Configuration Guide.
Security Considerations
Security is paramount. The AI Platforms are subject to regular security audits and vulnerability assessments. Key security measures include:
- Two-factor authentication for all user accounts.
- Strict firewall rules.
- Intrusion detection and prevention systems.
- Regular software updates and patching.
- Data encryption at rest and in transit.
Please refer to the Security Policy for detailed information.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️