Pricing
$10.00/month + usage
Wikipedia Data Scraper Pro
An automated crawler that extracts textual content and metadata from Wikipedia pages for building knowledge bases.
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 months ago
Last modified
Categories
Share
Wikipedia Scraper
Extract structured data from Wikipedia at any scale โ articles, categories, sections, links, categories, and multilingual translations โ without managing infrastructure.
What Does This Actor Do?
Wikipedia Scraper fetches public Wikipedia data through the official MediaWiki API. It supports three modes:
| Mode | Use Case |
|---|---|
| ๐ Article Pages | Scrape one or many articles by title |
| ๐ Category Crawl | Collect every article under a category (and its subcategories) |
| ๐ Translation Comparison | Fetch the same article across multiple language editions |
Every result is pushed to the Apify Dataset as a structured record you can download as JSON, CSV, Excel, or XML.
Output Fields
Each dataset item contains:
| Field | Type | Description |
|---|---|---|
title | string | Wikipedia article title |
language | string | Language code (en, de, fr, โฆ) |
pageId | integer | Wikipedia internal page ID |
url | string | Full URL to the article |
scrapedAt | ISO date | Timestamp of extraction |
summary | string | Full lead section text |
summaryPreview | string | First 200 characters of summary |
sections | array | Nested section tree (title + text + subsections) |
links | object | Outbound wiki links { title โ url } |
categories | object | Article categories { name โ url } |
translations | object | Other language editions { lang โ { title, url } } |
numSections | integer | Count of top-level sections |
numLinks | integer | Count of outbound links returned |
numCategories | integer | Count of categories returned |
numTranslations | integer | Count of available language editions |
status | string | ok, not_found, network_error, or error |
When Translation Comparison mode is used, items also include a comparisonBaseTitle field identifying the base English article.
Input Configuration
Mode: Article Pages
{"mode":"page","topics":["Python (programming language)","Alan Turing","Machine learning"],"language":"en","includeSections":true,"includeLinks":true,"includeCategories":true,"includeTranslations":false,"includeFullText":false}
Mode: Category Crawl
{"mode":"category","categoryTitle":"Category:Machine learning","language":"en","categoryMaxDepth":1,"maxPages":50,"includeSections":true,"includeLinks":false}
Mode: Translation Comparison
{"mode":"translations","comparisonTitle":"Artificial intelligence","translationLanguages":["en","de","fr","ja","ar","es","zh","ru"]}
Usage Examples
Using the Apify API (Python)
import apify_clientclient = apify_client.ApifyClient("YOUR_API_TOKEN")run = client.actor("YOUR_ACTOR_ID").call(run_input={"mode":"page","topics":["Deep learning","Neural network","Transformer (machine learning model)"],"language":"en","includeSections":True,"includeLinks":True})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["title"],"โ", item["url"])print(" Summary:", item["summaryPreview"])print(" Sections:", item["numSections"])
Using the Apify API (JavaScript/Node.js)
const{ ApifyClient }=require('apify-client');const client =newApifyClient({token:'YOUR_API_TOKEN'});const run =await client.actor('YOUR_ACTOR_ID').call({mode:'category',categoryTitle:'Category:Physics',categoryMaxDepth:1,maxPages:30,});const{ items }=await client.dataset(run.defaultDatasetId).listItems();items.forEach(item=> console.log(item.title, item.url));
Using the Apify CLI
# Install CLInpminstall-g apify-cli# Run locally (requires .actor/ directory)apify run --input='{"mode":"page","topics":["Quantum computing"]}'# Deploy to Apify platformapify push
Sections Structure Example
When includeSections is true, each article item contains a sections array:
{"sections":[{"level":1,"title":"History","text":"Python was conceived in the late 1980s by Guido van Rossum...","subsections":[{"level":2,"title":"Early development","text":"Python 0.9.0 was published to alt.sources in February 1991...","subsections":[]}]},{"level":1,"title":"Design philosophy","text":"Python is a multi-paradigm programming language...","subsections":[]}]}
Rate Limiting & Politeness
This actor follows the Wikimedia User-Agent policy:
- Uses a descriptive
User-Agentheader identifying itself asWikipedia Scraper / Apify Actor - Introduces a configurable delay (default 0.5 s) between every API call
- Respects Wikipedia's public API โ no login or authentication required
- Does not scrape HTML; uses the official MediaWiki REST API exclusively
If you encounter rate-limiting errors, increase the Request Delay setting to 1.0โ2.0 seconds.
Performance & Memory
| Input size | Recommended memory |
|---|---|
| 1โ20 articles | 256 MB |
| 20โ100 articles | 512 MB |
| Category crawl (100+ pages) | 1024 MB |
Limitations
- Wikipedia's API caps some response sizes (links, categories). This actor returns up to 50 links and 50 categories per page.
- Some Wikipedia editions have incomplete
langlinksmetadata. - Full-text extraction (
includeFullText: true) significantly increases dataset size. Enable only when needed. - Wikipedia may throttle aggressive requests. Keep
requestDelayโฅ 0.5 seconds.
Legal & Attribution
This actor accesses only publicly available Wikipedia content through the official MediaWiki API, in compliance with Wikipedia's Terms of Use and Creative Commons Attribution-ShareAlike 4.0 License.
All extracted content remains subject to Wikipedia's licensing. When republishing Wikipedia content, you must attribute Wikipedia and link to the original article.
