Data ingestion

From Server rental store
Revision as of 10:22, 15 April 2025 by Admin (talk | contribs) (Automated server configuration article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. Data Ingestion: Server Configuration

This article details the server configuration required for efficient data ingestion into our MediaWiki 1.40 environment. Proper configuration is crucial for maintaining performance and data integrity. This guide is intended for newcomers to the server administration team. It covers hardware specifications, software prerequisites, and key configuration parameters.

Understanding the Data Ingestion Pipeline

Our data ingestion pipeline handles various data sources, including database dumps, API feeds, and direct file uploads. The process broadly consists of three stages: receiving the data, transforming it into a suitable format for MediaWiki, and loading it into the database. Each stage relies on specific server resources and software components. Special:MyLanguage/Help:Contents provides general guidance on MediaWiki operation.

Hardware Specifications

The data ingestion server requires robust hardware to handle large datasets efficiently. The following table outlines the recommended specifications:

Component Specification
CPU Intel Xeon Gold 6248R (24 cores) or equivalent AMD EPYC processor
RAM 128 GB DDR4 ECC Registered RAM
Storage (OS) 500 GB NVMe SSD
Storage (Data) 4 TB RAID 10 SSD array
Network Interface 10 Gigabit Ethernet
Power Supply Redundant 800W Power Supplies

These specifications are a baseline and may need adjustment based on the volume and velocity of ingested data. See Special:MyLanguage/Manual:Configuration settings for more details on server requirements.

Software Prerequisites

Several software packages are essential for the data ingestion process. These include:

  • Operating System: CentOS Linux 7 or Ubuntu Server 20.04 LTS
  • Database: MariaDB 10.5 or MySQL 8.0 (configured as a replica of the main wiki database)
  • PHP: PHP 7.4 with required extensions (see below)
  • Python 3: For data transformation scripts.
  • SSH Access: Secure remote access for administration. Refer to Special:MyLanguage/Manual:Command-line access for more information.

PHP Extensions

The following PHP extensions are required:

Extension Purpose
php-mysql Connect to the MariaDB/MySQL database
php-xml Parse XML data from various sources
php-json Handle JSON data formats
php-mbstring Multibyte string support
php-curl Make HTTP requests for API data ingestion
php-zip Handle ZIP archives

Ensure these extensions are enabled in your `php.ini` file. Special:MyLanguage/Manual:Configuration settings#PHP/Extensions provides detailed instructions.

Configuration Parameters

Several key configuration parameters influence data ingestion performance. These parameters should be carefully tuned based on your specific environment.

Database Configuration

The database replica used for ingestion must be appropriately configured to handle the load. Consider the following settings in your `my.cnf` file:

Parameter Value Description
`innodb_buffer_pool_size` 64G Size of the InnoDB buffer pool. Adjust based on available RAM.
`innodb_log_file_size` 2G Size of the InnoDB log files. Larger values improve write performance.
`max_allowed_packet` 128M Maximum size of a single packet or generated/received string.
`read_buffer_size` 2M Buffer size used for sequential reads.

Regular database maintenance, including index optimization, is crucial. Special:MyLanguage/Manual:Database maintenance offers guidance.

PHP Configuration

Adjust PHP settings to optimize data processing:

  • `memory_limit`: Increase this value (e.g., to 8G) to handle large datasets during transformation.
  • `max_execution_time`: Extend the maximum execution time to prevent scripts from timing out.
  • `upload_max_filesize`: Adjust for large file uploads.

Data Transformation Scripts

Python scripts are used to transform data into a format suitable for MediaWiki. These scripts should be optimized for performance and error handling. Implement robust logging to track the ingestion process. Special:MyLanguage/Manual:API may be relevant if ingesting data via the API.

Security Considerations

  • Restrict access to the data ingestion server to authorized personnel only.
  • Use strong passwords and SSH keys for authentication.
  • Implement firewalls to protect the server from unauthorized access.
  • Regularly monitor server logs for suspicious activity.
  • Ensure all software packages are up to date with the latest security patches.

Monitoring and Logging

Implement comprehensive monitoring and logging to track the data ingestion process. Monitor CPU usage, memory consumption, disk I/O, and network traffic. Centralized logging provides a valuable audit trail. Special:MyLanguage/Manual:Monitoring explains the tools available for monitoring your MediaWiki installation.


Intel-Based Server Configurations

Configuration Specifications Benchmark
Core i7-6700K/7700 Server 64 GB DDR4, NVMe SSD 2 x 512 GB CPU Benchmark: 8046
Core i7-8700 Server 64 GB DDR4, NVMe SSD 2x1 TB CPU Benchmark: 13124
Core i9-9900K Server 128 GB DDR4, NVMe SSD 2 x 1 TB CPU Benchmark: 49969
Core i9-13900 Server (64GB) 64 GB RAM, 2x2 TB NVMe SSD
Core i9-13900 Server (128GB) 128 GB RAM, 2x2 TB NVMe SSD
Core i5-13500 Server (64GB) 64 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Server (128GB) 128 GB RAM, 2x500 GB NVMe SSD
Core i5-13500 Workstation 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000

AMD-Based Server Configurations

Configuration Specifications Benchmark
Ryzen 5 3600 Server 64 GB RAM, 2x480 GB NVMe CPU Benchmark: 17849
Ryzen 7 7700 Server 64 GB DDR5 RAM, 2x1 TB NVMe CPU Benchmark: 35224
Ryzen 9 5950X Server 128 GB RAM, 2x4 TB NVMe CPU Benchmark: 46045
Ryzen 9 7950X Server 128 GB DDR5 ECC, 2x2 TB NVMe CPU Benchmark: 63561
EPYC 7502P Server (128GB/1TB) 128 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/2TB) 128 GB RAM, 2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (128GB/4TB) 128 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/1TB) 256 GB RAM, 1 TB NVMe CPU Benchmark: 48021
EPYC 7502P Server (256GB/4TB) 256 GB RAM, 2x2 TB NVMe CPU Benchmark: 48021
EPYC 9454P Server 256 GB RAM, 2x2 TB NVMe

Order Your Dedicated Server

Configure and order your ideal server configuration

Need Assistance?

⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️