VOOZH about

URL: https://apify.com/moving_beacon-owner1/my-actor-39

โ‡ฑ Wikipedia Data Scraper Pro ยท Apify


Pricing

$10.00/month + usage

Go to Apify Store

Wikipedia Data Scraper Pro

An automated crawler that extracts textual content and metadata from Wikipedia pages for building knowledge bases.

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Jamshaid Arif

Jamshaid Arif

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 months ago

Last modified

Share

Wikipedia Scraper

Extract structured data from Wikipedia at any scale โ€” articles, categories, sections, links, categories, and multilingual translations โ€” without managing infrastructure.


What Does This Actor Do?

Wikipedia Scraper fetches public Wikipedia data through the official MediaWiki API. It supports three modes:

ModeUse Case
๐Ÿ“„ Article PagesScrape one or many articles by title
๐Ÿ“‚ Category CrawlCollect every article under a category (and its subcategories)
๐ŸŒ Translation ComparisonFetch the same article across multiple language editions

Every result is pushed to the Apify Dataset as a structured record you can download as JSON, CSV, Excel, or XML.


Output Fields

Each dataset item contains:

FieldTypeDescription
titlestringWikipedia article title
languagestringLanguage code (en, de, fr, โ€ฆ)
pageIdintegerWikipedia internal page ID
urlstringFull URL to the article
scrapedAtISO dateTimestamp of extraction
summarystringFull lead section text
summaryPreviewstringFirst 200 characters of summary
sectionsarrayNested section tree (title + text + subsections)
linksobjectOutbound wiki links { title โ†’ url }
categoriesobjectArticle categories { name โ†’ url }
translationsobjectOther language editions { lang โ†’ { title, url } }
numSectionsintegerCount of top-level sections
numLinksintegerCount of outbound links returned
numCategoriesintegerCount of categories returned
numTranslationsintegerCount of available language editions
statusstringok, not_found, network_error, or error

When Translation Comparison mode is used, items also include a comparisonBaseTitle field identifying the base English article.


Input Configuration

Mode: Article Pages

{
"mode":"page",
"topics":["Python (programming language)","Alan Turing","Machine learning"],
"language":"en",
"includeSections":true,
"includeLinks":true,
"includeCategories":true,
"includeTranslations":false,
"includeFullText":false
}

Mode: Category Crawl

{
"mode":"category",
"categoryTitle":"Category:Machine learning",
"language":"en",
"categoryMaxDepth":1,
"maxPages":50,
"includeSections":true,
"includeLinks":false
}

Mode: Translation Comparison

{
"mode":"translations",
"comparisonTitle":"Artificial intelligence",
"translationLanguages":["en","de","fr","ja","ar","es","zh","ru"]
}

Usage Examples

Using the Apify API (Python)

import apify_client
client = apify_client.ApifyClient("YOUR_API_TOKEN")
run = client.actor("YOUR_ACTOR_ID").call(run_input={
"mode":"page",
"topics":["Deep learning","Neural network","Transformer (machine learning model)"],
"language":"en",
"includeSections":True,
"includeLinks":True
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["title"],"โ†’", item["url"])
print(" Summary:", item["summaryPreview"])
print(" Sections:", item["numSections"])

Using the Apify API (JavaScript/Node.js)

const{ ApifyClient }=require('apify-client');
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('YOUR_ACTOR_ID').call({
mode:'category',
categoryTitle:'Category:Physics',
categoryMaxDepth:1,
maxPages:30,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item=> console.log(item.title, item.url));

Using the Apify CLI

# Install CLI
npminstall-g apify-cli
# Run locally (requires .actor/ directory)
apify run --input='{"mode":"page","topics":["Quantum computing"]}'
# Deploy to Apify platform
apify push

Sections Structure Example

When includeSections is true, each article item contains a sections array:

{
"sections":[
{
"level":1,
"title":"History",
"text":"Python was conceived in the late 1980s by Guido van Rossum...",
"subsections":[
{
"level":2,
"title":"Early development",
"text":"Python 0.9.0 was published to alt.sources in February 1991...",
"subsections":[]
}
]
},
{
"level":1,
"title":"Design philosophy",
"text":"Python is a multi-paradigm programming language...",
"subsections":[]
}
]
}

Rate Limiting & Politeness

This actor follows the Wikimedia User-Agent policy:

  • Uses a descriptive User-Agent header identifying itself as Wikipedia Scraper / Apify Actor
  • Introduces a configurable delay (default 0.5 s) between every API call
  • Respects Wikipedia's public API โ€” no login or authentication required
  • Does not scrape HTML; uses the official MediaWiki REST API exclusively

If you encounter rate-limiting errors, increase the Request Delay setting to 1.0โ€“2.0 seconds.


Performance & Memory

Input sizeRecommended memory
1โ€“20 articles256 MB
20โ€“100 articles512 MB
Category crawl (100+ pages)1024 MB

Limitations

  • Wikipedia's API caps some response sizes (links, categories). This actor returns up to 50 links and 50 categories per page.
  • Some Wikipedia editions have incomplete langlinks metadata.
  • Full-text extraction (includeFullText: true) significantly increases dataset size. Enable only when needed.
  • Wikipedia may throttle aggressive requests. Keep requestDelay โ‰ฅ 0.5 seconds.

Legal & Attribution

This actor accesses only publicly available Wikipedia content through the official MediaWiki API, in compliance with Wikipedia's Terms of Use and Creative Commons Attribution-ShareAlike 4.0 License.

All extracted content remains subject to Wikipedia's licensing. When republishing Wikipedia content, you must attribute Wikipedia and link to the original article.

You might also like

Wikipedia Page Dataset Scraper

scrapeai/wikipedia-page-dataset-scraper

Scrape Wikipedia articles and export structured dataset fields for training, knowledge bases, and research.

๐Ÿ“š Wikipedia Scraper โ€” Articles & Knowledge Data

nexgendata/wikipedia-scraper

Extract structured data from Wikipedia โ€” article text, infoboxes, categories, references & links. Build knowledge bases, AI training datasets & research tools. Pay per article.

Wikipedia Article Scraper

crawlerbros/wikipedia-scraper

Extract structured data from Wikipedia articles. Get summaries, categories, images, metadata, and descriptions using Wikipedia's official API. Supports 300+ languages.

Wikipedia Email Scraper - Advanced, Fast & Cheapest

contacts-api/wikipedia-email-scraper-fast-advanced-and-cheapest

๐Ÿ“š Wikipedia Email Scraper allows you to collect publicly available editor and organization emails from Wikipedia pages ๐Ÿ”Ž Great for research and academic outreach ๐Ÿ“ง

Wikipedia Scraper

automation-lab/wikipedia-scraper

Search and extract Wikipedia articles โ€” titles, summaries, full content, categories, and images. Uses the free MediaWiki API.

๐Ÿ‘ User avatar

Stas Persiianenko

19

Wikipedia Scraper

gio21/wikipedia-scraper

Search Wikipedia and return article summaries or full text via the public REST API. Supports 300+ languages. Useful for knowledge extraction, research, content generation, and entity enrichment.