AI in Monaco

From Server rental store
Revision as of 07:06, 16 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. AI in Monaco: Server Configuration

This document details the server configuration supporting the "AI in Monaco" project. This guide is intended for new engineers onboarding to the infrastructure team and provides a comprehensive overview of the hardware and software components. This project utilizes a distributed system designed for high throughput and low latency inference of large language models. See also System Architecture Overview for a broader context.

Overview

The "AI in Monaco" project leverages a cluster of dedicated servers to run a suite of AI models. These models are utilized for real-time data analysis and prediction, requiring significant computational resources. The server environment is based on a Linux distribution (Ubuntu 22.04 LTS) and utilizes a containerized deployment strategy with Docker and Kubernetes. Efficient resource management is crucial, as detailed in the Resource Allocation Policy. The primary goal of this configuration is to provide a scalable and reliable platform for AI model deployment and execution. Understanding the networking setup is vital, refer to Network Topology.

Hardware Specifications

The server cluster consists of the following hardware components. Each node adheres to the specification below.

Component Specification
CPU Dual Intel Xeon Gold 6338 (32 Cores / 64 Threads per CPU)
RAM 512GB DDR4 ECC Registered 3200MHz
Storage (OS) 1TB NVMe SSD (PCIe Gen4)
Storage (Models) 8 x 8TB SAS HDD (RAID 6)
Network Interface Dual 100GbE QSFP28
GPU 4 x NVIDIA A100 80GB

These specifications were chosen to balance compute power, memory capacity, and storage throughput, as discussed in the Hardware Selection Rationale. Regular hardware monitoring is performed using Nagios.

Software Stack

The software stack is built around a containerized environment, allowing for portability and scalability.

Software Version Purpose
Operating System Ubuntu 22.04 LTS Base OS for all servers
Docker 24.0.7 Containerization platform
Kubernetes 1.27.4 Container orchestration
NVIDIA Driver 535.104.05 GPU driver for CUDA and TensorRT
CUDA Toolkit 12.2 NVIDIA's parallel computing platform
TensorRT 8.6.1 NVIDIA's inference optimizer and runtime
Prometheus 2.46.0 Monitoring and alerting

The specific versions were selected for compatibility and performance, as documented in the Software Version Control. Regular software updates are managed via Ansible.

Networking Configuration

The network infrastructure is designed for high bandwidth and low latency communication between servers.

Parameter Value
Network Topology Clos network
Inter-Node Bandwidth 100GbE
Load Balancer HAProxy
DNS Bind9
Firewall iptables
Internal Network 10.0.0.0/16

Detailed network diagrams are available at Network Diagrams. Security considerations regarding network access are outlined in the Security Policy. Troubleshooting network issues can be done with tcpdump.


Security Considerations

Security is a paramount concern. All servers are behind a firewall and access is restricted to authorized personnel only. Regular security audits are conducted, as detailed in the Security Audit Schedule. SSL/TLS is used for all communication between servers and clients. User access is managed through LDAP. Intrusion detection systems are in place, and logs are monitored daily. Refer to Incident Response Plan for emergency procedures.


Future Scalability

The architecture is designed for future scalability. Adding new nodes to the Kubernetes cluster is a straightforward process. The storage infrastructure can be expanded by adding more SAS HDDs or migrating to a distributed file system like Ceph. The network infrastructure can be upgraded to 200GbE or 400GbE as needed. Considerations for scaling are detailed in the Scalability Plan.


Main Page Server Maintenance Troubleshooting Guide Deployment Procedures Monitoring Dashboard


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️