AI in Cleveland

From Server rental store
Revision as of 05:04, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. AI in Cleveland: Server Configuration Documentation

This document details the server configuration supporting the "AI in Cleveland" project, providing a technical overview for new administrators and developers. This project focuses on providing local AI services and data analysis capabilities to the Cleveland metropolitan area. It is critical to understand this setup for maintenance, scaling, and troubleshooting.

Overview

The "AI in Cleveland" infrastructure is built around a hybrid cloud model. Core processing and model training occur on dedicated on-premise hardware, while data storage and less intensive tasks are handled by cloud services. This allows for cost-effective scaling and data security. The primary goal is to provide accessible AI tools for local businesses and researchers. This project leverages several key software components including TensorFlow, PyTorch, and Kubernetes for orchestration.

Hardware Specifications

We utilize three primary server types: Compute Nodes, Storage Nodes, and the Master Node. Each plays a distinct role in the overall architecture.

Compute Nodes

These servers handle the bulk of the AI model training and inference. They are equipped with high-end GPUs and fast processors.

Specification Value
CPU Dual Intel Xeon Gold 6338 (32 cores/64 threads per CPU)
RAM 512GB DDR4 ECC Registered @ 3200MHz
GPU 4x NVIDIA A100 (80GB VRAM each)
Storage (Local) 2x 4TB NVMe SSD (RAID 0)
Network Interface Dual 100GbE QSFP28
Operating System Ubuntu 22.04 LTS

We currently operate six Compute Nodes. Each node is monitored using Nagios for performance and uptime.

Storage Nodes

These servers provide large-capacity storage for datasets and model checkpoints. They are optimized for high throughput and reliability.

Specification Value
CPU Intel Xeon Silver 4310 (12 cores/24 threads)
RAM 128GB DDR4 ECC Registered @ 2666MHz
Storage 16x 16TB SAS HDD (RAID 6) - 192TB Usable
Network Interface Dual 25GbE SFP28
Operating System CentOS 8

We have four Storage Nodes, configured for redundancy and scalability. Data is backed up nightly to a separate offsite location using rsync.

Master Node

The Master Node manages the Kubernetes cluster and provides a central point for monitoring and control.

Specification Value
CPU Intel Xeon Gold 6342 (28 cores/56 threads)
RAM 256GB DDR4 ECC Registered @ 3200MHz
Storage 2x 1TB NVMe SSD (RAID 1)
Network Interface Quad 10GbE SFP+
Operating System Ubuntu 22.04 LTS

The Master Node also hosts the Grafana dashboard for visualizing system metrics and the Prometheus time-series database.

Software Configuration

The software stack is designed for scalability and ease of management.

  • Kubernetes: We utilize Kubernetes for container orchestration, managing the deployment and scaling of AI applications. The version currently in use is 1.27. See Kubernetes Documentation for more details.
  • Docker: All applications are containerized using Docker, ensuring consistent environments across all nodes.
  • TensorFlow/PyTorch: The primary frameworks for model development and training. Versions are managed through pip and container images. TensorFlow website and PyTorch website
  • Ceph: Ceph is used as a distributed storage solution to provide a unified storage interface for the Storage Nodes. This allows for seamless data access for the Compute Nodes. Ceph documentation
  • PostgreSQL: A PostgreSQL database stores metadata about datasets, models, and users. It is crucial for tracking data provenance and access control.

Networking

The network infrastructure is a critical component of the "AI in Cleveland" project.

  • A dedicated VLAN is used for communication between servers.
  • Firewall rules are configured using iptables to restrict access to specific ports and services.
  • A load balancer distributes traffic to the Compute Nodes.
  • All network traffic is monitored using Wireshark for security and troubleshooting purposes. The load balancer is configured with HAProxy.

Security Considerations

Security is paramount. The following measures are in place:

  • Regular security audits are conducted.
  • All servers are protected by a firewall.
  • Access control is strictly enforced using LDAP.
  • Data is encrypted at rest and in transit.
  • Intrusion detection systems are in place.

Future Expansion

We plan to expand the infrastructure with additional Compute Nodes and Storage Nodes as demand increases. We are also exploring the use of specialized hardware accelerators, such as TPUs, to further improve performance. Future integration with OpenStack is also being considered.



Main Page Server Administration Data Storage Network Configuration Security Protocols Kubernetes Docker TensorFlow PyTorch Ceph PostgreSQL Nagios Grafana Prometheus iptables LDAP HAProxy OpenStack


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️