AI Best Practices

AI Best Practices: Server Configuration

This article outlines best practices for server configuration when deploying and running Artificial Intelligence (AI) workloads on our MediaWiki infrastructure. These guidelines are designed to maximize performance, stability, and scalability. This is a guide for newcomers to understand the key considerations. See Special:MyPage for contact information if you have questions.

1. Hardware Considerations

AI workloads, particularly those involving Machine Learning (ML), are computationally intensive. Choosing the right hardware is paramount. A solid foundation is crucial for successful deployment; see Help:Contents for general MediaWiki help.

The following table outlines recommended minimum specifications. These are *minimums*; performance will improve with increased resources. Always consult with the Help:System administrators before making hardware changes.

Component	Minimum Specification	Recommended Specification
CPU	Intel Xeon Silver 4210 or AMD EPYC 7282	Intel Xeon Gold 6248R or AMD EPYC 7763
RAM	64 GB DDR4 2666 MHz	256 GB DDR4 3200 MHz
Storage	1 TB NVMe SSD	4 TB NVMe SSD (RAID 1 recommended)
GPU	NVIDIA Tesla T4 (16GB)	NVIDIA A100 (80GB) or equivalent AMD Instinct MI250X
Network	10 Gbps Ethernet	40 Gbps InfiniBand or 25 Gbps Ethernet

Consider the type of AI workload. Deep Learning (DL) benefits massively from GPU acceleration. Natural Language Processing (NLP) may be more CPU-bound, but still benefits from fast storage and ample RAM. See Special:Search for previous discussions.

2. Operating System & Software Stack

We currently standardize on Ubuntu Server 22.04 LTS for AI workloads. This provides a stable base with excellent package availability. Other distributions may be considered with prior approval from Help:Policy.

The following software is essential:

CUDA Toolkit & cuDNN: (NVIDIA GPUs) For GPU-accelerated computation. Ensure compatibility with your GPU and chosen ML framework. Details can be found on the NVIDIA developer website.
NCCL: (NVIDIA GPUs, multi-GPU setups) For high-bandwidth communication between GPUs.
Python: (version 3.9 or higher) The primary language for most ML frameworks.
TensorFlow, PyTorch, or JAX: Choose the framework best suited to your AI task. See Help:Page for information on creating new pages on this topic.
Docker: Containerization for reproducibility and portability. See Help:Docker for information.
Kubernetes: (optional, for large-scale deployments) Orchestration of Docker containers.

3. Storage Configuration

Fast and reliable storage is critical. NVMe SSDs are *strongly* recommended over traditional HDDs. Consider RAID configurations for redundancy and improved performance.

The following table details storage considerations for different workload sizes:

Workload Size	Storage Type	RAID Level (Recommended)	Capacity (Minimum)
Small (e.g., experimentation, small datasets)	NVMe SSD	RAID 1	1 TB
Medium (e.g., training medium-sized models)	NVMe SSD	RAID 10	4 TB
Large (e.g., training large models, production inference)	NVMe SSD	RAID 10	8 TB+

Filesystems should be configured for optimal performance. XFS is generally preferred for large files and high throughput. Regular backups are essential; see Help:Backups.

4. Networking Considerations

High-bandwidth, low-latency networking is crucial for distributed training and inference. 10 Gbps Ethernet is a baseline; 40 Gbps InfiniBand offers superior performance for multi-node setups.

The following table summarizes networking best practices:

Aspect	Recommendation
Network Interface	Dedicated network interface for AI workloads.
Network Segmentation	Isolate AI traffic from other network traffic.
Jumbo Frames	Enable jumbo frames (9000 MTU) for reduced overhead.
RDMA (Remote Direct Memory Access)	Utilize RDMA over InfiniBand for low-latency communication.

5. Monitoring and Logging

Comprehensive monitoring and logging are essential for identifying and resolving performance bottlenecks and errors. Utilize tools such as:

Prometheus & Grafana: For time-series data collection and visualization.
ELK Stack (Elasticsearch, Logstash, Kibana): For centralized logging and analysis.
System Monitoring Tools: `top`, `htop`, `iostat`, `vmstat` for real-time system performance monitoring.
GPU Monitoring Tools: `nvidia-smi` for GPU utilization and health.

Regularly review logs and metrics to proactively identify and address potential issues. See Help:Monitoring for details on existing monitoring systems.

6. Security Considerations

AI systems can be vulnerable to attacks. Implement robust security measures:

Firewall: Restrict access to necessary ports only.
Access Control: Implement strict access control policies.
Data Encryption: Encrypt sensitive data at rest and in transit.
Regular Security Audits: Conduct regular security audits to identify and address vulnerabilities. Refer to Help:Security.

7. Further Resources

Special:Statistics – Site Statistics
Help:Editing – Editing Help
Help:Links – Linking to other pages.

Intel-Based Server Configurations

Configuration	Specifications	Benchmark
Core i7-6700K/7700 Server	64 GB DDR4, NVMe SSD 2 x 512 GB	CPU Benchmark: 8046
Core i7-8700 Server	64 GB DDR4, NVMe SSD 2x1 TB	CPU Benchmark: 13124
Core i9-9900K Server	128 GB DDR4, NVMe SSD 2 x 1 TB	CPU Benchmark: 49969
Core i9-13900 Server (64GB)	64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB)	128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB)	64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB)	128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation	64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration	Specifications	Benchmark
Ryzen 5 3600 Server	64 GB RAM, 2x480 GB NVMe	CPU Benchmark: 17849
Ryzen 7 7700 Server	64 GB DDR5 RAM, 2x1 TB NVMe	CPU Benchmark: 35224
Ryzen 9 5950X Server	128 GB RAM, 2x4 TB NVMe	CPU Benchmark: 46045
Ryzen 9 7950X Server	128 GB DDR5 ECC, 2x2 TB NVMe	CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB)	128 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB)	128 GB RAM, 2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB)	128 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB)	256 GB RAM, 1 TB NVMe	CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB)	256 GB RAM, 2x2 TB NVMe	CPU Benchmark: 48021
EPYC 9454P Server	256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

Telegram: @powervps Servers at a discounted price

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️

AI Best Practices