AI in Durham
- AI in Durham: Server Configuration
This document details the server configuration supporting the "AI in Durham" project. It is intended for new system administrators and developers contributing to the project's infrastructure. This project focuses on providing computational resources for local Artificial Intelligence research and development. It utilizes a hybrid cloud approach, leveraging both on-premise hardware and cloud services.
Overview
The "AI in Durham" project requires a robust and scalable server infrastructure to handle demanding workloads related to machine learning, deep learning, and data science. This infrastructure is designed for high throughput, low latency, and data security. We utilize a combination of GPU servers for training, CPU servers for inference and general processing, and a distributed file system for data storage. Cloud resources are used for burst capacity and specialized services. See Server Infrastructure Overview for a more general description of our server environments.
Hardware Specifications
The core on-premise infrastructure consists of three primary server types: GPU servers, CPU servers, and storage servers.
GPU Servers
These servers are dedicated to computationally intensive tasks like model training.
Specification | Value |
---|---|
Model | Dell PowerEdge R750xa |
CPU | 2 x Intel Xeon Gold 6348 (28 cores per CPU) |
GPU | 4 x NVIDIA A100 (80GB HBM2e) |
RAM | 512 GB DDR4 ECC REG |
Storage | 4 x 4TB NVMe PCIe Gen4 SSD (RAID 0) |
Network | 2 x 100GbE ConnectX-6 |
Operating System | Ubuntu 22.04 LTS |
These servers run CUDA Toolkit 12.2 and cuDNN 8.9.2, optimized for deep learning frameworks like TensorFlow and PyTorch. See GPU Server Maintenance for information on monitoring and upkeep.
CPU Servers
These servers handle inference, data pre-processing, and general-purpose computing tasks.
Specification | Value |
---|---|
Model | Supermicro Super Server 2029U-TR4 |
CPU | 2 x AMD EPYC 7763 (64 cores per CPU) |
RAM | 1TB DDR4 ECC REG |
Storage | 8 x 8TB SATA HDD (RAID 6) + 2 x 1TB NVMe SSD (OS) |
Network | 2 x 25GbE |
Operating System | CentOS Stream 9 |
These servers utilize Docker containers for application isolation and portability. For detailed information on containerization practices, see Containerization Best Practices.
Storage Servers
These servers provide centralized storage for datasets and model artifacts.
Specification | Value |
---|---|
Model | NetApp FAS2750 |
Storage Capacity | 368 TB Raw (Usable varies with RAID configuration) |
RAID Level | RAID-6 |
Network | 4 x 40GbE InfiniBand |
File System | ONTAP |
Storage is accessed via NFS and SMB protocols. Refer to the Data Storage Policy for details on data backup and recovery procedures.
Software Stack
The software stack is designed for flexibility and scalability. Key components include:
- Operating Systems: Ubuntu 22.04 LTS (GPU Servers), CentOS Stream 9 (CPU Servers)
- Containerization: Docker and Kubernetes
- Machine Learning Frameworks: TensorFlow, PyTorch, Scikit-learn
- Data Science Tools: Jupyter Notebook, RStudio
- Monitoring: Prometheus, Grafana
- Version Control: Git and GitHub
- Configuration Management: Ansible
- Networking: SSH, VPN
Network Configuration
The server infrastructure is connected via a dedicated 100GbE backbone network. A separate 1GbE network provides access for administrative tasks and general use. Firewall rules are configured to restrict access to essential services only. See Network Security Protocol for details. DNS is managed internally using BIND9.
Cloud Integration
We utilize Amazon Web Services (AWS) for burst capacity and specialized services such as:
- AWS S3: For long-term data storage and archiving.
- AWS EC2: For on-demand GPU instances during peak training periods.
- AWS SageMaker: For managed machine learning services.
Communication between on-premise servers and AWS is secured via AWS VPN. For information on cloud cost management, see Cloud Cost Optimization.
Security Considerations
Security is paramount. All servers are regularly patched and monitored for vulnerabilities. Access control is enforced using strong authentication and authorization mechanisms. Data is encrypted both in transit and at rest. See Security Best Practices for comprehensive guidelines. The Incident Response Plan details procedures for handling security breaches.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️