VOOZH about

URL: https://apify.com/parseforge/article-extractor

⇱ Scrape and download articles and news Β· Apify


Pricing

from $40.00 / 1,000 results

Go to Apify Store

Smart Article Extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

Pricing

from $40.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

2

Monthly active users

22 days ago

Last modified

Share

πŸ‘ ParseForge Banner

πŸ“° Smart Article Extractor

πŸš€ Parse any news article or blog post into clean structured text in seconds. Get 23 metadata fields per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.

πŸ•’ Last updated: 2026-04-21 Β· πŸ“Š 23 fields per article Β· 🌐 Works on any site Β· ⚑ 10 articles in ~10 seconds Β· πŸ’° Paywall detection

Pull structured records from Smart Article Extractor β€” clean fields ready as CSV, JSON, JSONL, Excel, or XML for downstream pipelines.

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parseforge/article-extractor on Apify. Call: ApifyClient("TOKEN").actor("parseforge/article-extractor").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for results. Key inputs: startUrls (array, default [{"url":"https://www.bbc.com/news/articles/c86w8elez74o"}]), maxItems (integer, default 10). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parseforge~article-extractor (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

The Smart Article Extractor takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.

Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.

🎯 Target AudienceπŸ’‘ Primary Use Cases
News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivistsNews datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly

πŸ“‹ What the Smart Article Extractor does

Five extraction workflows in a single run:

  • πŸ“ Main body extraction. DOM scoring isolates the article content and strips navigation, ads, and sidebars.
  • πŸ‘₯ Author detection. Pulls authors from meta tags, JSON-LD, and itemprop attributes.
  • πŸ“… Date stamps. Captures both article:published_time and article:modified_time.
  • 🏷️ Tags and section. Extracts article:tag and article:section metadata.
  • πŸ’° Paywall flag. Heuristic detects common paywall markers so you can filter downstream.

Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.

πŸ’‘ Why it matters: news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing extraction across news sites, blogs, and platforms.


βš™οΈ Input

InputTypeDefaultBehavior
startUrlsarray of URLsrequiredOne or more article URLs to extract.
maxItemsinteger10Articles returned. Free plan caps at 10, paid plan at 1,000,000.

Example: extract a single article.

{
"startUrls":[
{"url":"https://techcrunch.com/2025/01/10/openai-launches-gpt-store/"}
],
"maxItems":1
}

Example: batch extraction for media monitoring.

{
"startUrls":[
{"url":"https://www.theverge.com/2025/ai-coverage-1"},
{"url":"https://www.wired.com/story/ai-agents-2026"},
{"url":"https://arstechnica.com/ai/article"}
],
"maxItems":100
}

⚠️ Good to Know: works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.


πŸ“Š Output

Each record contains 23 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
πŸ”— urlstring"https://techcrunch.com/.../gpt-store/"
πŸ” canonicalUrlstring | null"https://techcrunch.com/.../gpt-store/"
🏷️ titlestring | null"OpenAI launches GPT Store"
πŸ“‘ subtitlestring | null"Available to Plus, Team, Enterprise"
πŸ§‘ authorstring | null"Kyle Wiggers"
πŸ‘₯ authorsstring[]["Kyle Wiggers"]
πŸ“… publishedAtISO 8601 | null"2025-01-10T14:00:00Z"
πŸ” modifiedAtISO 8601 | null"2025-01-10T16:30:00Z"
🏒 siteNamestring | null"TechCrunch"
πŸ—‚οΈ sectionstring | null"AI"
🏷️ tagsstring[]["openai", "gpt-store"]
🌍 languagestring | null"en-US"
πŸ“ descriptionstring | null"OpenAI rolled out the long-teased GPT Store..."
πŸ–ΌοΈ leadImagestring | null"https://.../og.jpg"
🎨 imagesstring[]["https://...", "https://..."]
πŸ“ƒ markdownstring"# OpenAI launches GPT Store..."
πŸ’¬ textstringplain text without markdown markers
🧾 htmlstringcleaned article HTML
πŸ”’ wordCountnumber742
⏱️ readingTimeMinutesnumber4
πŸ’° hasPaywallbooleanfalse
🟒 httpStatusnumber200
πŸ•’ scrapedAtISO 8601"2026-04-21T12:00:00.000Z"
❗ errorstring | null"Timeout" on failure

πŸ“¦ Sample records


✨ Why choose this Actor

Capability
🧠DOM scoring. Readability-style extraction works across any article-shaped page without per-site rules.
πŸ“Š23 fields. Authors, tags, section, dates, images, paywall, reading time, and canonical URL.
πŸ’°Paywall detection. Flags articles likely behind a paywall so you can filter them out.
⚑Fast. 10 articles in under 10 seconds with parallel fetching.
πŸ–ΌοΈImage capture. Lead image plus every inline image URL in the article body.
🚫No credentials. Runs on any public article URL.
πŸ”ŒIntegrations. Plugs into RSS feeds, newsroom tools, and news datasets.

πŸ“Š Clean article text is the foundation of news summarization, sentiment analysis, and media monitoring. This Actor delivers it consistently without per-site parsers.


πŸ“ˆ How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Smart Article Extractor (this Actor)$5 free credit, then pay-per-useAny public article URLLive per run23 metadata fields⚑ 2 min
Open-source readability libsFreeWhatever you hostYour codeWhatever you build🐒 Days
News API services$99+/monthCurated feedsReal-timePer-plan limits⏳ Hours
Paid media monitoring$$$+/monthManaged sourcesReal-timeRich UIπŸ•’ Variable

Pick this Actor when you want article text from arbitrary URLs without maintaining your own extraction library.


πŸš€ How to use

  1. πŸ“ Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Smart Article Extractor page on the Apify Store.
  3. 🎯 Paste URLs. Add article URLs to the startUrls field and set maxItems.
  4. πŸš€ Run it. Click Start and let the Actor extract the content.
  5. πŸ“₯ Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


πŸ’Ό Business use cases

πŸ“° News Aggregation

  • Build custom news feeds across sources
  • Deduplicate stories across outlets
  • Normalize article structure for downstream apps
  • Feed summarization pipelines

🧠 AI & Summarization

  • Extract clean text for LLM summaries
  • Build news datasets for fine-tuning
  • Ground chatbots with current media
  • Power question-answering over news

πŸ“‘ Media Monitoring

  • Track brand mentions across outlets
  • Monitor coverage of products or events
  • Capture executive quotes and bylines
  • Detect paywalled coverage to license

πŸ“š Research & Archives

  • Build academic text corpora
  • Archive public journalism
  • Extract metadata for bibliographies
  • Preserve retracted or deleted articles


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

πŸŽ“ Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🀝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

πŸ§ͺ Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

πŸ€– Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

πŸ”Œ Automating Smart Article Extractor

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟒 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • πŸ“š See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Pair it with an RSS reader or Google News feed for continuous media monitoring.

πŸ”Œ Integrate with any app

Smart Article Extractor connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Post article summaries to channels
  • Airbyte - Pipe articles into your warehouse
  • GitHub - Trigger runs from commits
  • Google Drive - Export articles to Docs

You can also use webhooks to trigger summarization and alerting pipelines when new articles finish extracting.


πŸ”— Recommended Actors

πŸ’‘ Pro Tip: browse the complete ParseForge collection for more content-extraction tools.


πŸ†˜ Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.

You might also like

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

News Article Scraper β€” Newsroom & Press Release Extractor

scrapepilot/company-ok

Scrape full article content from any newsroom, press release page, or blog. Get title, author, publish date, summary, SEO keywords, word count, and full body text. Auto-discovers article links. Checkpoint resume. $5 per 1,000 articles

AI Blog Dataset Creator

datapilot/ai-blog-dataset-creator

Smart Article Scraper Actor extracts structured article data from URLs using, and Newspaper3k. It collects title, author, publish date, tags, full content, language, and word count. Supports proxy usage, JavaScript-rendered pages, and outputs clean JSON datasets.

Article Content Extractor πŸ“„

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. πŸ”πŸ“„

Google News Scraper

scrapeai/google-news-scraper

Scrape Google News articles from news.google.com using any search query. Extract title, source, date, link, and snippet. Optional deep scrape visits each article to collect full text, author, images, keywords, metadata, word count, and reading time.

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

Web Article Extractor β€” Clean Reader Mode Text & Metadata

maged120/reader-mode

Extract clean, readable article content from any web page. Strips ads, navigation, and clutter β€” returns title, author, full body text, and publish date in structured JSON.

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.