Smart Article Extractor

Pricing

from $40.00 / 1,000 results

Smart Article Extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

Pricing

from $40.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 ParseForge

ParseForge

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

22 days ago

Last modified

📰 Smart Article Extractor

🚀 Parse any news article or blog post into clean structured text in seconds. Get 23 metadata fields per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.

🕒 Last updated: 2026-04-21 · 📊 23 fields per article · 🌐 Works on any site · ⚡ 10 articles in ~10 seconds · 💰 Paywall detection

Pull structured records from Smart Article Extractor — clean fields ready as CSV, JSON, JSONL, Excel, or XML for downstream pipelines.

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parseforge/article-extractor on Apify. Call: ApifyClient("TOKEN").actor("parseforge/article-extractor").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for results. Key inputs: startUrls (array, default [{"url":"https://www.bbc.com/news/articles/c86w8elez74o"}]), maxItems (integer, default 10). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parseforge~article-extractor (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

The Smart Article Extractor takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.

Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.

🎯 Target Audience	💡 Primary Use Cases
News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivists	News datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly

📋 What the Smart Article Extractor does

Five extraction workflows in a single run:

📝 Main body extraction. DOM scoring isolates the article content and strips navigation, ads, and sidebars.
👥 Author detection. Pulls authors from meta tags, JSON-LD, and itemprop attributes.
📅 Date stamps. Captures both article:published_time and article:modified_time.
🏷️ Tags and section. Extracts article:tag and article:section metadata.
💰 Paywall flag. Heuristic detects common paywall markers so you can filter downstream.

Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.

💡 Why it matters: news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing extraction across news sites, blogs, and platforms.

⚙️ Input

Input	Type	Default	Behavior
startUrls	array of URLs	required	One or more article URLs to extract.
maxItems	integer	10	Articles returned. Free plan caps at 10, paid plan at 1,000,000.

Example: extract a single article.

{
"startUrls":[
{"url":"https://techcrunch.com/2025/01/10/openai-launches-gpt-store/"}
],
"maxItems":1
}

Example: batch extraction for media monitoring.

{
"startUrls":[
{"url":"https://www.theverge.com/2025/ai-coverage-1"},
{"url":"https://www.wired.com/story/ai-agents-2026"},
{"url":"https://arstechnica.com/ai/article"}
],
"maxItems":100
}

⚠️ Good to Know: works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.

📊 Output

Each record contains 23 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

Field	Type	Example
🔗 url	string	`"https://techcrunch.com/.../gpt-store/"`
🔁 canonicalUrl	string \| null	`"https://techcrunch.com/.../gpt-store/"`
🏷️ title	string \| null	`"OpenAI launches GPT Store"`
📑 subtitle	string \| null	`"Available to Plus, Team, Enterprise"`
🧑 author	string \| null	`"Kyle Wiggers"`
👥 authors	string[]	`["Kyle Wiggers"]`
📅 publishedAt	ISO 8601 \| null	`"2025-01-10T14:00:00Z"`
🔁 modifiedAt	ISO 8601 \| null	`"2025-01-10T16:30:00Z"`
🏢 siteName	string \| null	`"TechCrunch"`
🗂️ section	string \| null	`"AI"`
🏷️ tags	string[]	`["openai", "gpt-store"]`
🌍 language	string \| null	`"en-US"`
📝 description	string \| null	`"OpenAI rolled out the long-teased GPT Store..."`
🖼️ leadImage	string \| null	`"https://.../og.jpg"`
🎨 images	string[]	`["https://...", "https://..."]`
📃 markdown	string	`"# OpenAI launches GPT Store..."`
💬 text	string	plain text without markdown markers
🧾 html	string	cleaned article HTML
🔢 wordCount	number	`742`
⏱️ readingTimeMinutes	number	`4`
💰 hasPaywall	boolean	false
🟢 httpStatus	number	`200`
🕒 scrapedAt	ISO 8601	`"2026-04-21T12:00:00.000Z"`
❗ error	string \| null	`"Timeout"` on failure

📦 Sample records

✨ Why choose this Actor

	Capability
🧠	DOM scoring. Readability-style extraction works across any article-shaped page without per-site rules.
📊	23 fields. Authors, tags, section, dates, images, paywall, reading time, and canonical URL.
💰	Paywall detection. Flags articles likely behind a paywall so you can filter them out.
⚡	Fast. 10 articles in under 10 seconds with parallel fetching.
🖼️	Image capture. Lead image plus every inline image URL in the article body.
🚫	No credentials. Runs on any public article URL.
🔌	Integrations. Plugs into RSS feeds, newsroom tools, and news datasets.

📊 Clean article text is the foundation of news summarization, sentiment analysis, and media monitoring. This Actor delivers it consistently without per-site parsers.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
⭐ Smart Article Extractor (this Actor)	$5 free credit, then pay-per-use	Any public article URL	Live per run	23 metadata fields	⚡ 2 min
Open-source readability libs	Free	Whatever you host	Your code	Whatever you build	🐢 Days
News API services	$99+/month	Curated feeds	Real-time	Per-plan limits	⏳ Hours
Paid media monitoring	$$$+/month	Managed sources	Real-time	Rich UI	🕒 Variable

Pick this Actor when you want article text from arbitrary URLs without maintaining your own extraction library.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the Smart Article Extractor page on the Apify Store.
🎯 Paste URLs. Add article URLs to the startUrls field and set maxItems.
🚀 Run it. Click Start and let the Actor extract the content.
📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.

💼 Business use cases

📰 News Aggregation

Build custom news feeds across sources
Deduplicate stories across outlets
Normalize article structure for downstream apps
Feed summarization pipelines

🧠 AI & Summarization

Extract clean text for LLM summaries
Build news datasets for fine-tuning
Ground chatbots with current media
Power question-answering over news

📡 Media Monitoring

Track brand mentions across outlets
Monitor coverage of products or events
Capture executive quotes and bylines
Detect paywalled coverage to license

📚 Research & Archives

Build academic text corpora
Archive public journalism
Extract metadata for bibliographies
Preserve retracted or deleted articles

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

🔌 Automating Smart Article Extractor

Control the scraper programmatically for scheduled runs and pipeline integrations:

🟢 Node.js. Install the apify-client NPM package.
🐍 Python. Use the apify-client PyPI package.
📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Pair it with an RSS reader or Google News feed for continuous media monitoring.

🔌 Integrate with any app

Smart Article Extractor connects to any cloud service via Apify integrations:

Make - Automate multi-step workflows
Zapier - Connect with 5,000+ apps
Slack - Post article summaries to channels
Airbyte - Pipe articles into your warehouse
GitHub - Trigger runs from commits
Google Drive - Export articles to Docs

You can also use webhooks to trigger summarization and alerting pipelines when new articles finish extracting.

🔗 Recommended Actors

🤖 RAG Web Browser - Search or fetch URLs with LLM-ready output
🕸️ Website Content Crawler - Deep-crawl a domain with depth control
🔍 Google Search Scraper - SERP results with rank and description
📈 Google Trends Scraper - Interest over time and related queries
📧 Contact Info Scraper - Emails, phones, and socials from URLs

💡 Pro Tip: browse the complete ParseForge collection for more content-extraction tools.

🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.

👁 Smart Article & Blog Extractor avatar

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

👁 User avatar

Lightkong

👁 Smart Article Extractor avatar

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

👁 User avatar

Data Pilot

👁 News Article Scraper — Newsroom & Press Release Extractor avatar

News Article Scraper — Newsroom & Press Release Extractor

scrapepilot/company-ok

Scrape full article content from any newsroom, press release page, or blog. Get title, author, publish date, summary, SEO keywords, word count, and full body text. Auto-discovers article links. Checkpoint resume. $5 per 1,000 articles

👁 User avatar

Scrape Pilot

👁 AI Blog Dataset Creator avatar

AI Blog Dataset Creator

datapilot/ai-blog-dataset-creator

Smart Article Scraper Actor extracts structured article data from URLs using, and Newspaper3k. It collects title, author, publish date, tags, full content, language, and word count. Supports proxy usage, JavaScript-rendered pages, and outputs clean JSON datasets.

👁 User avatar

Data Pilot

👁 Article Content Extractor 📄 avatar

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

👁 User avatar

EasyApi

127

👁 Google News Scraper avatar

Google News Scraper

scrapeai/google-news-scraper

Scrape Google News articles from news.google.com using any search query. Extract title, source, date, link, and snippet. Optional deep scrape visits each article to collect full text, author, images, keywords, metadata, word count, and reading time.

👁 User avatar

ScrapeAI

5.0

👁 Google News Article Scraper avatar

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

👁 User avatar

WebScrap

News Article Extractor for AI & RAG

wiry_kingdom/news-article-extractor-ai

Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.

👁 User avatar

Mohieldin Mohamed

👁 Web Article Extractor — Clean Reader Mode Text & Metadata avatar

Web Article Extractor — Clean Reader Mode Text & Metadata

maged120/reader-mode

Extract clean, readable article content from any web page. Strips ads, navigation, and clutter — returns title, author, full body text, and publish date in structured JSON.

👁 User avatar

Maged

👁 News Website Crawler & Article Extractor avatar

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

👁 User avatar

Xtech

403

4.8

URL: https://apify.com/parseforge/article-extractor

⇱ Scrape and download articles and news · Apify

Smart Article Extractor

📰 Smart Article Extractor

Copy to your AI assistant

📋 What the Smart Article Extractor does

🎬 Full Demo

⚙️ Input

📊 Output

🧾 Schema

📦 Sample records

✨ Why choose this Actor

📈 How it compares to alternatives

🚀 How to use

💼 Business use cases

📰 News Aggregation

🧠 AI & Summarization

📡 Media Monitoring

📚 Research & Archives

🌟 Beyond business use cases

🎓 Research and academia

🎨 Personal and creative

🤝 Non-profit and civic

🧪 Experimentation

🤖 Ask an AI assistant about this scraper

❓ Frequently Asked Questions

🔌 Automating Smart Article Extractor

🔌 Integrate with any app

🔗 Recommended Actors

You might also like

Smart Article & Blog Extractor

Smart Article Extractor

News Article Scraper — Newsroom & Press Release Extractor

AI Blog Dataset Creator

Article Content Extractor 📄

Google News Scraper

Google News Article Scraper

News Article Extractor for AI & RAG

Web Article Extractor — Clean Reader Mode Text & Metadata

News Website Crawler & Article Extractor