AI in Cleveland
- AI in Cleveland: Server Configuration Documentation
This document details the server configuration supporting the "AI in Cleveland" project, providing a technical overview for new administrators and developers. This project focuses on providing local AI services and data analysis capabilities to the Cleveland metropolitan area. It is critical to understand this setup for maintenance, scaling, and troubleshooting.
Overview
The "AI in Cleveland" infrastructure is built around a hybrid cloud model. Core processing and model training occur on dedicated on-premise hardware, while data storage and less intensive tasks are handled by cloud services. This allows for cost-effective scaling and data security. The primary goal is to provide accessible AI tools for local businesses and researchers. This project leverages several key software components including TensorFlow, PyTorch, and Kubernetes for orchestration.
Hardware Specifications
We utilize three primary server types: Compute Nodes, Storage Nodes, and the Master Node. Each plays a distinct role in the overall architecture.
Compute Nodes
These servers handle the bulk of the AI model training and inference. They are equipped with high-end GPUs and fast processors.
Specification | Value |
---|---|
CPU | Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU) |
RAM | 512GB DDR4 ECC Registered @ 3200MHz |
GPU | 4x NVIDIA A100 (80GB VRAM each) |
Storage (Local) | 2x 4TB NVMe SSD (RAID 0) |
Network Interface | Dual 100GbE QSFP28 |
Operating System | Ubuntu 22.04 LTS |
We currently operate six Compute Nodes. Each node is monitored using Nagios for performance and uptime.
Storage Nodes
These servers provide large-capacity storage for datasets and model checkpoints. They are optimized for high throughput and reliability.
Specification | Value |
---|---|
CPU | Intel Xeon Silver 4310 (12 cores/24 threads) |
RAM | 128GB DDR4 ECC Registered @ 2666MHz |
Storage | 16x 16TB SAS HDD (RAID 6) - 192TB Usable |
Network Interface | Dual 25GbE SFP28 |
Operating System | CentOS 8 |
We have four Storage Nodes, configured for redundancy and scalability. Data is backed up nightly to a separate offsite location using rsync.
Master Node
The Master Node manages the Kubernetes cluster and provides a central point for monitoring and control.
Specification | Value |
---|---|
CPU | Intel Xeon Gold 6342 (28 cores/56 threads) |
RAM | 256GB DDR4 ECC Registered @ 3200MHz |
Storage | 2x 1TB NVMe SSD (RAID 1) |
Network Interface | Quad 10GbE SFP+ |
Operating System | Ubuntu 22.04 LTS |
The Master Node also hosts the Grafana dashboard for visualizing system metrics and the Prometheus time-series database.
Software Configuration
The software stack is designed for scalability and ease of management.
- Kubernetes: We utilize Kubernetes for container orchestration, managing the deployment and scaling of AI applications. The version currently in use is 1.27. See Kubernetes Documentation for more details.
- Docker: All applications are containerized using Docker, ensuring consistent environments across all nodes.
- TensorFlow/PyTorch: The primary frameworks for model development and training. Versions are managed through pip and container images. TensorFlow website and PyTorch website
- Ceph: Ceph is used as a distributed storage solution to provide a unified storage interface for the Storage Nodes. This allows for seamless data access for the Compute Nodes. Ceph documentation
- PostgreSQL: A PostgreSQL database stores metadata about datasets, models, and users. It is crucial for tracking data provenance and access control.
Networking
The network infrastructure is a critical component of the "AI in Cleveland" project.
- A dedicated VLAN is used for communication between servers.
- Firewall rules are configured using iptables to restrict access to specific ports and services.
- A load balancer distributes traffic to the Compute Nodes.
- All network traffic is monitored using Wireshark for security and troubleshooting purposes. The load balancer is configured with HAProxy.
Security Considerations
Security is paramount. The following measures are in place:
- Regular security audits are conducted.
- All servers are protected by a firewall.
- Access control is strictly enforced using LDAP.
- Data is encrypted at rest and in transit.
- Intrusion detection systems are in place.
Future Expansion
We plan to expand the infrastructure with additional Compute Nodes and Storage Nodes as demand increases. We are also exploring the use of specialized hardware accelerators, such as TPUs, to further improve performance. Future integration with OpenStack is also being considered.
Main Page
Server Administration
Data Storage
Network Configuration
Security Protocols
Kubernetes
Docker
TensorFlow
PyTorch
Ceph
PostgreSQL
Nagios
Grafana
Prometheus
iptables
LDAP
HAProxy
OpenStack
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️