AutoHarvester Documentation
- AutoHarvester Documentation
This document details the server configuration for the AutoHarvester, a critical component of our data ingestion pipeline. It is intended for system administrators, server engineers, and anyone responsible for maintaining the AutoHarvester infrastructure. This guide covers hardware specifications, software dependencies, and configuration parameters.
Overview
The AutoHarvester is responsible for automatically collecting data from various external sources and importing it into our MediaWiki instance. It operates on a dedicated server to minimize impact on the primary wiki server and ensure consistent data flow. The system utilizes a combination of scheduled tasks, web scraping, and API integrations to achieve its function. Understanding the server configuration is crucial for troubleshooting issues, scaling the system, and ensuring data integrity. See Special:MyPreferences for personalization options related to viewing this documentation.
Hardware Specifications
The AutoHarvester server requires specific hardware resources to operate efficiently. These specifications are minimum recommendations and may need to be adjusted based on the volume of data being harvested.
Component | Specification |
---|---|
CPU | Intel Xeon E3-1270 v5 (or equivalent) |
RAM | 16 GB DDR4 ECC |
Storage | 1 TB SSD (RAID 1 recommended) |
Network Interface | 1 Gbps Ethernet |
Power Supply | 500W 80+ Gold |
These specifications ensure the AutoHarvester can handle the processing load without impacting performance. For more details on Help:Contents regarding hardware maintenance, please consult the IT department.
Software Dependencies
The AutoHarvester relies on several software components to function correctly. These include the operating system, programming languages, databases, and various libraries.
Software | Version |
---|---|
Operating System | Ubuntu Server 22.04 LTS |
Python | 3.10 |
MySQL | 8.0 |
Beautiful Soup | 4.11 |
Requests | 2.28 |
Cron | Default Ubuntu implementation |
Regular updates to these components are essential for security and stability. See Special:Search for information on past updates. Ensure compatibility between versions to avoid conflicts. Consult the Help:FAQ for frequently asked questions.
Configuration Parameters
The AutoHarvester’s behavior is controlled by several configuration parameters. These parameters are stored in a configuration file located at `/etc/autoharvester/config.ini`.
Parameter | Description | Default Value |
---|---|---|
`wiki_url` | The URL of the MediaWiki instance. | `https://www.example.com/wiki/` |
`wiki_user` | The username for the MediaWiki account used for harvesting. | `AutoHarvester` |
`wiki_password` | The password for the MediaWiki account. | `securepassword` |
`harvest_interval` | The interval (in minutes) between harvesting runs. | `60` |
`data_sources` | A list of data sources to harvest. Defined as a JSON array. | `[{"url": "https://example.com/data1", "type": "web"}, {"api_endpoint": "https://api.example.com", "type": "api"}]` |
Modifying these parameters requires careful consideration and testing. Incorrect configuration can lead to data ingestion errors or security vulnerabilities. Always back up the configuration file before making changes. Refer to Special:Random for a random page to test your configuration.
Networking Configuration
The AutoHarvester server needs to be accessible from the internet to retrieve data and from the MediaWiki server to upload the harvested data.
- Firewall Rules: Ensure that the firewall allows incoming connections on port 80 (HTTP) and 443 (HTTPS) for web scraping, and outgoing connections to the MediaWiki server on port 443 (HTTPS) for API access.
- DNS Resolution: Verify that the AutoHarvester server can resolve the domain names of the data sources and the MediaWiki server.
- Proxy Settings: If a proxy server is required to access the internet, configure the proxy settings in the `config.ini` file. See Help:Linking and referencing for more information about linking to external resources.
Security Considerations
Security is paramount when operating the AutoHarvester.
- Account Security: The MediaWiki account used for harvesting should have limited permissions. Avoid using an administrator account.
- Data Validation: Implement data validation checks to prevent the ingestion of malicious or corrupted data.
- Regular Audits: Conduct regular security audits to identify and address potential vulnerabilities.
- SSL/TLS: Always use HTTPS for all communication with data sources and the MediaWiki server. Review Special:Statistics to see usage trends and potential security concerns.
Troubleshooting
If the AutoHarvester is not functioning correctly, check the following:
- Log Files: Examine the log files located in `/var/log/autoharvester/` for error messages.
- Configuration File: Verify that the `config.ini` file is correctly configured.
- Network Connectivity: Ensure that the server can connect to the data sources and the MediaWiki server.
- Resource Usage: Monitor CPU, memory, and disk usage to identify potential bottlenecks.
- Cron Jobs: Confirm that the scheduled tasks are running as expected. Check Help:Table of contents for a comprehensive overview of the wiki's structure.
Future Enhancements
Planned enhancements for the AutoHarvester include:
- Support for more data source types.
- Improved data validation and error handling.
- A web-based interface for managing data sources and monitoring harvesting runs.
- Automated scaling based on data volume.
- Integration with our monitoring system for proactive alerts.
Intel-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Core i7-6700K/7700 Server | 64 GB DDR4, NVMe SSD 2 x 512 GB | CPU Benchmark: 8046 |
Core i7-8700 Server | 64 GB DDR4, NVMe SSD 2x1 TB | CPU Benchmark: 13124 |
Core i9-9900K Server | 128 GB DDR4, NVMe SSD 2 x 1 TB | CPU Benchmark: 49969 |
Core i9-13900 Server (64GB) | 64 GB RAM, 2x2 TB NVMe SSD | |
Core i9-13900 Server (128GB) | 128 GB RAM, 2x2 TB NVMe SSD | |
Core i5-13500 Server (64GB) | 64 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Server (128GB) | 128 GB RAM, 2x500 GB NVMe SSD | |
Core i5-13500 Workstation | 64 GB DDR5 RAM, 2 NVMe SSD, NVIDIA RTX 4000 |
AMD-Based Server Configurations
Configuration | Specifications | Benchmark |
---|---|---|
Ryzen 5 3600 Server | 64 GB RAM, 2x480 GB NVMe | CPU Benchmark: 17849 |
Ryzen 7 7700 Server | 64 GB DDR5 RAM, 2x1 TB NVMe | CPU Benchmark: 35224 |
Ryzen 9 5950X Server | 128 GB RAM, 2x4 TB NVMe | CPU Benchmark: 46045 |
Ryzen 9 7950X Server | 128 GB DDR5 ECC, 2x2 TB NVMe | CPU Benchmark: 63561 |
EPYC 7502P Server (128GB/1TB) | 128 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/2TB) | 128 GB RAM, 2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (128GB/4TB) | 128 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/1TB) | 256 GB RAM, 1 TB NVMe | CPU Benchmark: 48021 |
EPYC 7502P Server (256GB/4TB) | 256 GB RAM, 2x2 TB NVMe | CPU Benchmark: 48021 |
EPYC 9454P Server | 256 GB RAM, 2x2 TB NVMe |
Order Your Dedicated Server
Configure and order your ideal server configuration
Need Assistance?
- Telegram: @powervps Servers at a discounted price
⚠️ *Note: All benchmark scores are approximate and may vary based on configuration. Server availability subject to stock.* ⚠️