AI in Jersey
- AI in Jersey: Server Configuration Documentation
This document details the server configuration for the "AI in Jersey" project, providing a comprehensive guide for system administrators and developers. This project focuses on running large language models (LLMs) for natural language processing tasks within the Jersey data center. This guide assumes a basic understanding of Linux server administration and networking concepts. This is intended for newcomers to the wiki, so explanations will be thorough.
Overview
The "AI in Jersey" infrastructure is built around a cluster of high-performance servers dedicated to model training and inference. The core components include GPU servers, CPU servers for pre- and post-processing, a high-bandwidth network interconnect, and a shared storage system. This setup allows for efficient handling of large datasets and complex model architectures. We utilize a distributed computing framework to parallelize workloads across the cluster. See Distributed Computing for more details on this.
Hardware Specifications
The following tables outline the hardware specifications for each server type within the cluster.
GPU Servers
These servers are the workhorses of the AI cluster, responsible for the computationally intensive tasks of model training and inference. Each server is equipped with multiple high-end GPUs.
Component | Specification |
---|---|
Server Model | Dell PowerEdge R760xa |
CPU | 2 x AMD EPYC 7763 (64-core) |
GPU | 8 x NVIDIA A100 80GB |
RAM | 512 GB DDR4 ECC REG |
Storage | 2 x 4TB NVMe SSD (RAID 1) |
Network Interface | 2 x 200Gbps InfiniBand |
Power Supply | Redundant 3000W Platinum |
CPU Servers
These servers handle data pre-processing, post-processing, and orchestrate the overall workflow.
Component | Specification |
---|---|
Server Model | HP ProLiant DL380 Gen10 |
CPU | 2 x Intel Xeon Gold 6338 (32-core) |
RAM | 256 GB DDR4 ECC REG |
Storage | 4 x 8TB SAS HDD (RAID 5) |
Network Interface | 2 x 100Gbps Ethernet |
Power Supply | Redundant 800W Platinum |
Storage Server
This server provides shared storage for the entire cluster, accessible via a high-speed network.
Component | Specification |
---|---|
Server Model | NetApp FAS2750 |
CPU | 2 x Intel Xeon Gold 6248R (24-core) |
RAM | 128 GB DDR4 ECC REG |
Storage | 16 x 18TB SAS HDD (RAID-DP) - 288TB usable capacity |
Network Interface | 4 x 100Gbps Ethernet |
Connectivity | Fibre Channel over Ethernet (FCoE) |
Software Stack
The software stack is carefully chosen to maximize performance and scalability.
- Operating System: Ubuntu 22.04 LTS – Provides a stable and well-supported platform. Ubuntu Server is a common choice.
- CUDA Toolkit: 11.8 – Required for GPU acceleration. See CUDA Installation Guide for setup instructions.
- cuDNN: 8.6.0 – A library for deep neural networks, optimized for NVIDIA GPUs.
- TensorFlow: 2.12.0 – A popular deep learning framework. TensorFlow Documentation
- PyTorch: 2.0.1 – Another widely used deep learning framework. PyTorch Website
- MPI: Open MPI 4.1.4 – For distributed training and inference. Refer to MPI Tutorial for more information.
- NCCL: 2.14 – NVIDIA Collective Communications Library for multi-GPU communication.
- Kubernetes: 1.27 – Container orchestration for managing deployments. Read the Kubernetes Basics article.
- Docker: 20.10.21 – Containerization platform for packaging and deploying applications. See Docker Installation
- Monitoring: Prometheus and Grafana – For system monitoring and performance analysis. Prometheus Setup and Grafana Configuration are helpful resources.
Networking Configuration
The network is critical for performance. A low-latency, high-bandwidth network is essential for communication between servers.
- Interconnect: 200Gbps InfiniBand between GPU servers and 100Gbps Ethernet for CPU and storage servers.
- Network Topology: Fat-tree topology for optimal bandwidth and low latency.
- Network Segmentation: VLANs are used to isolate different types of traffic (e.g., management, storage, data transfer).
- Firewall: A stateful firewall is configured to protect the cluster from unauthorized access. Consult the Firewall Configuration guide.
- DNS: Internal DNS server for resolving hostnames within the cluster.
Security Considerations
Security is paramount. The following security measures are in place:
- Access Control: Role-Based Access Control (RBAC) is implemented to restrict access to sensitive resources.
- Encryption: All data in transit is encrypted using TLS/SSL.
- Regular Security Audits: Periodic security audits are conducted to identify and address vulnerabilities.
- Intrusion Detection/Prevention System (IDS/IPS): An IDS/IPS is deployed to detect and prevent malicious activity.
- Patch Management: A rigorous patch management process is followed to ensure that all software is up-to-date. See Security Best Practices.
Future Expansion
We plan to expand the cluster with additional GPU servers and storage capacity as the needs of the "AI in Jersey" project grow. We are also exploring the use of newer GPU architectures and faster network interconnects. Please see Future Infrastructure Plans for details.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️