Data ingestion
- Data Ingestion: Server Configuration
This article details the server configuration required for efficient data ingestion into our MediaWiki 1.40 environment. Proper configuration is crucial for maintaining performance and data integrity. This guide is intended for newcomers to the server administration team. It covers hardware specifications, software prerequisites, and key configuration parameters.
Understanding the Data Ingestion Pipeline
Our data ingestion pipeline handles various data sources, including database dumps, API feeds, and direct file uploads. The process broadly consists of three stages: receiving the data, transforming it into a suitable format for MediaWiki, and loading it into the database. Each stage relies on specific server resources and software components. Special:MyLanguage/Help:Contents provides general guidance on MediaWiki operation.
Hardware Specifications
The data ingestion server requires robust hardware to handle large datasets efficiently. The following table outlines the recommended specifications:
Component | Specification |
---|---|
CPU | Intel Xeon Gold 6248R (24 cores) or equivalent AMD EPYC processor |
RAM | 128 GB DDR4 ECC Registered RAM |
Storage (OS) | 500 GB NVMe SSD |
Storage (Data) | 4 TB RAID 10 SSD array |
Network Interface | 10 Gigabit Ethernet |
Power Supply | Redundant 800W Power Supplies |
These specifications are a baseline and may need adjustment based on the volume and velocity of ingested data. See Special:MyLanguage/Manual:Configuration settings for more details on server requirements.
Software Prerequisites
Several software packages are essential for the data ingestion process. These include:
- Operating System: CentOS Linux 7 or Ubuntu Server 20.04 LTS
- Database: MariaDB 10.5 or MySQL 8.0 (configured as a replica of the main wiki database)
- PHP: PHP 7.4 with required extensions (see below)
- Python 3: For data transformation scripts.
- SSH Access: Secure remote access for administration. Refer to Special:MyLanguage/Manual:Command-line access for more information.
PHP Extensions
The following PHP extensions are required:
Extension | Purpose |
---|---|
php-mysql | Connect to the MariaDB/MySQL database |
php-xml | Parse XML data from various sources |
php-json | Handle JSON data formats |
php-mbstring | Multibyte string support |
php-curl | Make HTTP requests for API data ingestion |
php-zip | Handle ZIP archives |
Ensure these extensions are enabled in your `php.ini` file. Special:MyLanguage/Manual:Configuration settings#PHP/Extensions provides detailed instructions.
Configuration Parameters
Several key configuration parameters influence data ingestion performance. These parameters should be carefully tuned based on your specific environment.
Database Configuration
The database replica used for ingestion must be appropriately configured to handle the load. Consider the following settings in your `my.cnf` file:
Parameter | Value | Description |
---|---|---|
`innodb_buffer_pool_size` | 64G | Size of the InnoDB buffer pool. Adjust based on available RAM. |
`innodb_log_file_size` | 2G | Size of the InnoDB log files. Larger values improve write performance. |
`max_allowed_packet` | 128M | Maximum size of a single packet or generated/received string. |
`read_buffer_size` | 2M | Buffer size used for sequential reads. |
Regular database maintenance, including index optimization, is crucial. Special:MyLanguage/Manual:Database maintenance offers guidance.
PHP Configuration
Adjust PHP settings to optimize data processing:
- `memory_limit`: Increase this value (e.g., to 8G) to handle large datasets during transformation.
- `max_execution_time`: Extend the maximum execution time to prevent scripts from timing out.
- `upload_max_filesize`: Adjust for large file uploads.
Data Transformation Scripts
Python scripts are used to transform data into a format suitable for MediaWiki. These scripts should be optimized for performance and error handling. Implement robust logging to track the ingestion process. Special:MyLanguage/Manual:API may be relevant if ingesting data via the API.
Security Considerations
- Restrict access to the data ingestion server to authorized personnel only.
- Use strong passwords and SSH keys for authentication.
- Implement firewalls to protect the server from unauthorized access.
- Regularly monitor server logs for suspicious activity.
- Ensure all software packages are up to date with the latest security patches.
Monitoring and Logging
Implement comprehensive monitoring and logging to track the data ingestion process. Monitor CPU usage, memory consumption, disk I/O, and network traffic. Centralized logging provides a valuable audit trail. Special:MyLanguage/Manual:Monitoring explains the tools available for monitoring your MediaWiki installation.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️