AI in Durham

From Server rental store
Revision as of 05:20, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. AI in Durham: Server Configuration

This document details the server configuration supporting the "AI in Durham" project. It is intended for new system administrators and developers contributing to the project's infrastructure. This project focuses on providing computational resources for local Artificial Intelligence research and development. It utilizes a hybrid cloud approach, leveraging both on-premise hardware and cloud services.

Overview

The "AI in Durham" project requires a robust and scalable server infrastructure to handle demanding workloads related to machine learning, deep learning, and data science. This infrastructure is designed for high throughput, low latency, and data security. We utilize a combination of GPU servers for training, CPU servers for inference and general processing, and a distributed file system for data storage. Cloud resources are used for burst capacity and specialized services. See Server Infrastructure Overview for a more general description of our server environments.

Hardware Specifications

The core on-premise infrastructure consists of three primary server types: GPU servers, CPU servers, and storage servers.

GPU Servers

These servers are dedicated to computationally intensive tasks like model training.

Specification Value
Model Dell PowerEdge R750xa
CPU 2 x Intel Xeon Gold 6348 (28 cores per CPU)
GPU 4 x NVIDIA A100 (80GB HBM2e)
RAM 512 GB DDR4 ECC REG
Storage 4 x 4TB NVMe PCIe Gen4 SSD (RAID 0)
Network 2 x 100GbE ConnectX-6
Operating System Ubuntu 22.04 LTS

These servers run CUDA Toolkit 12.2 and cuDNN 8.9.2, optimized for deep learning frameworks like TensorFlow and PyTorch. See GPU Server Maintenance for information on monitoring and upkeep.

CPU Servers

These servers handle inference, data pre-processing, and general-purpose computing tasks.

Specification Value
Model Supermicro Super Server 2029U-TR4
CPU 2 x AMD EPYC 7763 (64 cores per CPU)
RAM 1TB DDR4 ECC REG
Storage 8 x 8TB SATA HDD (RAID 6) + 2 x 1TB NVMe SSD (OS)
Network 2 x 25GbE
Operating System CentOS Stream 9

These servers utilize Docker containers for application isolation and portability. For detailed information on containerization practices, see Containerization Best Practices.

Storage Servers

These servers provide centralized storage for datasets and model artifacts.

Specification Value
Model NetApp FAS2750
Storage Capacity 368 TB Raw (Usable varies with RAID configuration)
RAID Level RAID-6
Network 4 x 40GbE InfiniBand
File System ONTAP

Storage is accessed via NFS and SMB protocols. Refer to the Data Storage Policy for details on data backup and recovery procedures.

Software Stack

The software stack is designed for flexibility and scalability. Key components include:

Network Configuration

The server infrastructure is connected via a dedicated 100GbE backbone network. A separate 1GbE network provides access for administrative tasks and general use. Firewall rules are configured to restrict access to essential services only. See Network Security Protocol for details. DNS is managed internally using BIND9.

Cloud Integration

We utilize Amazon Web Services (AWS) for burst capacity and specialized services such as:

  • AWS S3: For long-term data storage and archiving.
  • AWS EC2: For on-demand GPU instances during peak training periods.
  • AWS SageMaker: For managed machine learning services.

Communication between on-premise servers and AWS is secured via AWS VPN. For information on cloud cost management, see Cloud Cost Optimization.

Security Considerations

Security is paramount. All servers are regularly patched and monitored for vulnerabilities. Access control is enforced using strong authentication and authorization mechanisms. Data is encrypted both in transit and at rest. See Security Best Practices for comprehensive guidelines. The Incident Response Plan details procedures for handling security breaches.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️