Server rental store

GPU Management Tools

# GPU Management Tools

This article details various tools for managing Graphics Processing Units (GPUs) on our MediaWiki server infrastructure. Effective GPU management is crucial for tasks like video transcoding, machine learning, and accelerated rendering, all of which contribute to the performance and scalability of the wiki. This guide is intended for system administrators and engineers responsible for maintaining our server environment.

Introduction to GPU Management

Modern servers often leverage GPUs for workloads beyond traditional graphics processing. Managing these GPUs requires specialized tools to monitor their health, control their usage, and optimize performance. This document outlines some of the key tools we utilize, covering aspects from low-level control to high-level monitoring. Proper configuration and monitoring are vital to prevent performance bottlenecks and ensure the stability of our wiki services. See also Server Monitoring and Resource Allocation.

NVIDIA Management Tools

NVIDIA GPUs are the predominant type in our server fleet. As such, several NVIDIA-provided tools are essential for their management.

NVIDIA System Management Interface (nvidia-smi)

`nvidia-smi` is a command-line utility providing detailed information about NVIDIA GPUs. It's the primary tool for monitoring GPU utilization, temperature, memory usage, and power consumption. It also allows for limited control over GPU settings.

Feature Description
Monitoring Provides real-time statistics on GPU usage, temperature, memory, and power.
Control Enables basic control, such as setting power limits and monitoring clocks.
Querying Queries specific GPU attributes, like driver version and GPU name.
Compatibility Works with most NVIDIA GPUs and drivers.

Example Usage: `nvidia-smi` (for a general overview), `nvidia-smi -l 1` (to refresh output every 1 second). Refer to NVIDIA Documentation for a complete list of options.

NVIDIA Data Center GPU Manager (DCGM)

DCGM is a more comprehensive tool designed for data center environments. It provides advanced monitoring, control, and diagnostics capabilities. DCGM can be used to set GPU policies, manage memory, and detect faults. It is more complex to configure than `nvidia-smi` but offers significantly more functionality. See GPU Driver Installation for compatibility notes.

Component Description
DCGM Exporter Collects GPU metrics and exposes them in Prometheus format for monitoring.
DCGM Scheduler Allows for scheduling GPU workloads and managing resource allocation.
DCGM API Provides a programmatic interface for interacting with DCGM.

NVIDIA Virtual GPU (vGPU) Software

For virtualized environments, NVIDIA vGPU software is crucial. It allows multiple virtual machines (VMs) to share a single physical GPU, improving resource utilization and reducing costs. This is essential for our Virtualization Infrastructure.

Open-Source GPU Management Tools

While NVIDIA provides robust tools, several open-source alternatives offer valuable features.

Prometheus and Grafana

Prometheus is a time-series database, and Grafana is a data visualization tool. Using the DCGM Exporter (mentioned above) or other exporters, we can collect GPU metrics with Prometheus and visualize them with Grafana. This provides a centralized monitoring solution for all server resources, including GPUs. See Monitoring System Setup for details.

Tool Purpose
Prometheus Collects and stores time-series data, including GPU metrics.
Grafana Visualizes data from Prometheus in customizable dashboards.
Alertmanager Handles alerts based on predefined rules from Prometheus.

CoolBits

CoolBits is a setting within the X server configuration (xorg.conf) that unlocks additional features in `nvidia-smi`, such as detailed power reporting and thermal throttling information. Enabling CoolBits requires careful consideration and testing. See X Server Configuration for more information.

GPU Monitoring Best Practices

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️