AI Best Practices
AI Best Practices: Server Configuration
This article outlines best practices for server configuration when deploying and running Artificial Intelligence (AI) workloads on our MediaWiki infrastructure. These guidelines are designed to maximize performance, stability, and scalability. This is a guide for newcomers to understand the key considerations. See Special:MyPage for contact information if you have questions.
1. Hardware Considerations
AI workloads, particularly those involving Machine Learning (ML), are computationally intensive. Choosing the right hardware is paramount. A solid foundation is crucial for successful deployment; see Help:Contents for general MediaWiki help.
The following table outlines recommended minimum specifications. These are *minimums*; performance will improve with increased resources. Always consult with the Help:System administrators before making hardware changes.
Component | Minimum Specification | Recommended Specification |
---|---|---|
CPU | Intel Xeon Silver 4210 or AMD EPYC 7282 | Intel Xeon Gold 6248R or AMD EPYC 7763 |
RAM | 64 GB DDR4 2666 MHz | 256 GB DDR4 3200 MHz |
Storage | 1 TB NVMe SSD | 4 TB NVMe SSD (RAID 1 recommended) |
GPU | NVIDIA Tesla T4 (16GB) | NVIDIA A100 (80GB) or equivalent AMD Instinct MI250X |
Network | 10 Gbps Ethernet | 40 Gbps InfiniBand or 25 Gbps Ethernet |
Consider the type of AI workload. Deep Learning (DL) benefits massively from GPU acceleration. Natural Language Processing (NLP) may be more CPU-bound, but still benefits from fast storage and ample RAM. See Special:Search for previous discussions.
2. Operating System & Software Stack
We currently standardize on Ubuntu Server 22.04 LTS for AI workloads. This provides a stable base with excellent package availability. Other distributions may be considered with prior approval from Help:Policy.
The following software is essential:
- CUDA Toolkit & cuDNN: (NVIDIA GPUs) For GPU-accelerated computation. Ensure compatibility with your GPU and chosen ML framework. Details can be found on the NVIDIA developer website.
- NCCL: (NVIDIA GPUs, multi-GPU setups) For high-bandwidth communication between GPUs.
- Python: (version 3.9 or higher) The primary language for most ML frameworks.
- TensorFlow, PyTorch, or JAX: Choose the framework best suited to your AI task. See Help:Page for information on creating new pages on this topic.
- Docker: Containerization for reproducibility and portability. See Help:Docker for information.
- Kubernetes: (optional, for large-scale deployments) Orchestration of Docker containers.
3. Storage Configuration
Fast and reliable storage is critical. NVMe SSDs are *strongly* recommended over traditional HDDs. Consider RAID configurations for redundancy and improved performance.
The following table details storage considerations for different workload sizes:
Workload Size | Storage Type | RAID Level (Recommended) | Capacity (Minimum) |
---|---|---|---|
Small (e.g., experimentation, small datasets) | NVMe SSD | RAID 1 | 1 TB |
Medium (e.g., training medium-sized models) | NVMe SSD | RAID 10 | 4 TB |
Large (e.g., training large models, production inference) | NVMe SSD | RAID 10 | 8 TB+ |
Filesystems should be configured for optimal performance. XFS is generally preferred for large files and high throughput. Regular backups are essential; see Help:Backups.
4. Networking Considerations
High-bandwidth, low-latency networking is crucial for distributed training and inference. 10 Gbps Ethernet is a baseline; 40 Gbps InfiniBand offers superior performance for multi-node setups.
The following table summarizes networking best practices:
Aspect | Recommendation |
---|---|
Network Interface | Dedicated network interface for AI workloads. |
Network Segmentation | Isolate AI traffic from other network traffic. |
Jumbo Frames | Enable jumbo frames (9000 MTU) for reduced overhead. |
RDMA (Remote Direct Memory Access) | Utilize RDMA over InfiniBand for low-latency communication. |
5. Monitoring and Logging
Comprehensive monitoring and logging are essential for identifying and resolving performance bottlenecks and errors. Utilize tools such as:
- Prometheus & Grafana: For time-series data collection and visualization.
- ELK Stack (Elasticsearch, Logstash, Kibana): For centralized logging and analysis.
- System Monitoring Tools: `top`, `htop`, `iostat`, `vmstat` for real-time system performance monitoring.
- GPU Monitoring Tools: `nvidia-smi` for GPU utilization and health.
Regularly review logs and metrics to proactively identify and address potential issues. See Help:Monitoring for details on existing monitoring systems.
6. Security Considerations
AI systems can be vulnerable to attacks. Implement robust security measures:
- Firewall: Restrict access to necessary ports only.
- Access Control: Implement strict access control policies.
- Data Encryption: Encrypt sensitive data at rest and in transit.
- Regular Security Audits: Conduct regular security audits to identify and address vulnerabilities. Refer to Help:Security.
7. Further Resources
- Special:Statistics – Site Statistics
- Help:Editing – Editing Help
- Help:Links – Linking to other pages.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️