Smart Article & Blog Extractor

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

👁 Lightkong

Lightkong

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🧠 Smart Article & Blog Extractor

The ultimate tool for LLMs, RAG pipelines, and Content Analyzers. Extract clean, ad-free text from any news site, blog, or article in seconds.

Why this Actor?

When you train AI models or build RAG (Retrieval-Augmented Generation) systems, you don't want menus, sidebars, cookie popups, or footer links ruining your dataset. You only want the Title, Author, and the actual Content.

This actor uses Mozilla's powerful Readability algorithm (the same engine that powers Firefox's Reader View) to automatically strip away all the junk and give you a beautifully clean text output.

Advantages:

Universal: Works on Medium, TechCrunch, WordPress blogs, Substack, CNN, NYTimes, and 99% of other article pages.
Ultra-Fast: Uses HTTP requests (CheerioCrawler), extracting articles in less than a second per page.
Cost-Effective: Because it doesn't open heavy browsers, your Apify Compute Unit (CU) costs are practically zero.

💰 Pricing: Pay-Per-Result

We charge only $0.50 per 1,000 articles extracted.

📥 Input Schema

Field	Type	Description
`startUrls`	Array	A list of article or blog URLs you want to extract.
`proxyConfiguration`	Object	Standard Apify proxy settings to bypass IP blocks.

📤 Output Schema

For each URL, the actor will produce a clean JSON object.

{
"url":"https://techcrunch.com/2023/12/20/example-article/",
"title":"The Future of Artificial Intelligence",
"author":"Jane Doe",
"publishedTime":"2023-12-20T10:00:00Z",
"siteName":"TechCrunch",
"textContent":"Artificial intelligence has been evolving rapidly... (clean text continues)",
"readingTimeMins":4,
"scrapedAt":"2026-04-30T17:30:00.000Z"
}

Start extracting clean knowledge today!

👁 Smart Article Extractor avatar

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

👁 User avatar

ParseForge

News Article Extractor for AI & RAG

wiry_kingdom/news-article-extractor-ai

Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.

👁 User avatar

Mohieldin Mohamed

👁 Blog Scraper avatar

Blog Scraper

naive_zing/blog-scraper

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

👁 User avatar

Wyald