🧠 Smart Article Extractor

Pricing

from $3.99 / 1,000 results

Try for free

Go to Apify Store

👁 🧠 Smart Article Extractor

🧠 Smart Article Extractor

Try for free

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

👁 Scrapio

Scrapio

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

🧠 Smart Article Extractor — News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content — title, author, publish date, full text, summary, images, videos, in-body links and rich metadata — from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.

🚀 Why Choose Us?

Feature	Smart Article Extractor	Typical 1-URL article scraper
Bulk discovery (BFS crawler)	✅ Yes	❌ One URL at a time
Sitemap & robots.txt scanning	✅ Built-in	❌
Sub-domain / sub-path scoping	✅ Per Start URL	❌
`onlyNewArticles` cross-run dedup	✅ Per-domain & global	❌
Date filters (`dateFrom`, `lastDays`, `mustHaveDate`)	✅ All three	⚠️ Limited
Anti-block proxy fallback (none → DC → RES)	✅ Automatic	❌
Optional Playwright rendering	✅ Toggle	❌
Extend-output Python hook	✅ Inline snippet	❌
Live dataset push + state KVS	✅	⚠️

🔥 Key Features

📰 Clean article extraction — trafilatura + BeautifulSoup combo for high recall.
🌐 Bulk discovery — drop a homepage URL and the actor discovers articles via BFS.
🗺️ Sitemap & robots.txt — automatic Sitemap: parsing + common candidates.
🛡️ Smart proxy fallback — starts direct, then datacenter, then residential.
🎭 Headless browser mode — Playwright + Chromium for JS-heavy or protected sites.
🧠 Cross-run memory — onlyNewArticles and onlyNewArticlesPerDomain.
🪜 Depth / page / article caps — never over-crawl.
📅 Date filters — dateFrom, onlyArticlesForLastDays, mustHaveDate.
🛠️ extendOutputFunction — inject your own Python extend(soup, article, html).
💾 Save HTML / snapshots — full HTML in-record or as KVS link, PNG screenshots.

📥 Input

Field	Type	Default	Description
`startUrls`	array	required	Homepages, sections, topic pages — used as crawl seeds.
`articleUrls`	array	`[]`	Direct article URLs to extract (no discovery needed).
`onlyNewArticles`	boolean	`false`	Skip URLs already seen in any previous run.
`onlyNewArticlesPerDomain`	boolean	`false`	Per-domain dedup memory.
`onlyInsideArticles`	boolean	`true`	Enqueue only same-domain links from articles.
`onlySubdomainArticles`	boolean	`false`	Restrict to URLs sharing the Start URL path prefix.
`enqueueFromArticles`	boolean	`true`	Discover further links inside extracted articles.
`crawlWholeSubdomain`	boolean	`true`	Treat any same-subdomain link as a category candidate.
`scanSitemaps`	boolean	`true`	Discover articles from `robots.txt` and common sitemap paths.
`useGoogleBotHeaders`	boolean	`true`	Identify as Googlebot.
`useBrowser`	boolean	`false`	Render with headless Chromium.
`scrollToBottom`	boolean	`false`	Force lazy-loaded content (browser mode only).
`mustHaveDate`	boolean	`false`	Drop articles with no detectable date.
`dateFrom`	string (ISO date)	—	Earliest article date.
`onlyArticlesForLastDays`	integer	—	Convenience cut-off.
`minWords`	integer	`150`	Reject short articles.
`maxDepth`	integer	`2`	BFS depth.
`maxPagesPerCrawl`	integer	`50`	Hard cap on fetched pages.
`maxArticlesPerCrawl`	integer	`25`	Hard cap on saved articles.
`maxArticlesPerStartUrl`	integer	`25`	Cap per Start URL.
`isUrlArticleDefinition`	object	see schema	URL-shape heuristic.
`linkSelector`	string	—	CSS selector restricting where links are collected from.
`pseudoUrls`	array	`[]`	Custom URL patterns for category pages.
`sitemapUrls`	array	`[]`	Explicit sitemap URLs (skip auto-discovery).
`saveHtml`	boolean	`false`	Include raw HTML in the dataset record.
`saveHtmlAsLink`	boolean	`false`	Save HTML to KVS and put a link in the record.
`saveSnapshots`	boolean	`false`	PNG screenshot (browser mode only).
`extendOutputFunction`	string	—	Python snippet — must define `extend(soup, article, html) -> dict`.
`proxyConfiguration`	object	`{useApifyProxy: false}`	Default = no proxy; auto-fallback to DC → RES if blocked.

Example input:

{
"startUrls":[{"url":"https://www.theguardian.com"}],
"onlyArticlesForLastDays":2,
"minWords":150,
"maxArticlesPerCrawl":5,
"useGoogleBotHeaders":true,
"scanSitemaps":true,
"proxyConfiguration":{"useApifyProxy":false}
}

📤 Output

Each pushed record contains:

Field	Type	Description
`url`, `loadedUrl`	string	Original / resolved URL.
`domain`, `loadedDomain`	string	Bare host.
`referrer`, `startUrl`	string	Where the link was discovered.
`depth`	integer	BFS depth at time of crawl.
`title`, `softTitle`	string	Best-effort headline.
`date`	string (ISO)	Publication date if found.
`author`	array	Author URL(s) or name(s).
`publisher`, `copyright`, `lang`, `favicon`, `canonicalLink`	string	Site metadata.
`description`, `keywords`	string	Meta description / keywords.
`tags`	array	`article:tag` values.
`image`	string	Hero / OG image URL.
`videos`	array	`<video> / <iframe> / <source>` URLs.
`links`	array of `{text, href}`	Inner-body links.
`wordCount`	integer	Word count of the extracted text.
`text`	string	Cleaned article body.
`html`	string	Full HTML (only if `saveHtml` / `saveHtmlAsLink`).
`screenshotUrl`	string	KVS link (only if `saveSnapshots` + `useBrowser`).

Example output (truncated):

{
"url":"https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
"domain":"theguardian.com",
"title":"How often should you go to the toilet?…",
"date":"2026-05-21T04:00:02.000Z",
"author":["https://www.theguardian.com/profile/sarahphillips"],
"publisher":"the Guardian",
"wordCount":1620,
"text":"Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
"image":"https://i.guim.co.uk/img/media/…"
}

🚀 How to Use (Apify Console)

Log in at https://console.apify.com → Actors.
Open Smart Article Extractor.
Configure inputs (Start URLs, date filters, caps, proxy).
Click Start.
Watch logs in real time — the actor prints a per-article live feed.
Open the Output tab once the run completes.
Export to JSON / CSV / XLSX or wire to a webhook.

🤖 Use via API / MCP

curl-X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "startUrls": [{"url": "https://www.theguardian.com"}],
 "maxArticlesPerCrawl": 5,
 "onlyArticlesForLastDays": 2,
 "proxyConfiguration": {"useApifyProxy": false}
 }'

MCP-server tool name: smart-article-extractor.

💡 Best Use Cases

📰 News monitoring on a topic / publisher
📊 NLP / sentiment / summarisation datasets
🏛️ Brand or competitor coverage tracking
🔍 SEO / SERP enrichment with full article text
📚 Knowledge-base construction for RAG / LLMs
🗞️ Press-clipping archives

💰 Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.

❓ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter — date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS — if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes — supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.

🛟 Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.

⚖️ Cautions / legal

Data is collected only from publicly available sources.
Do not scrape private accounts or content behind authentication unless explicitly authorised.
The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs — please be a good citizen.

🧠 Smart Article Extractor

scraper-engine/smart-article-extractor

👁 User avatar

Scraper Engine

🧠 Smart Article Extractor

scrapier/smart-article-extractor

👁 User avatar

Scrapier

🧠 Smart Article Extractor

simpleapi/smart-article-extractor

👁 User avatar

SimpleAPI

🧠 Smart Article Extractor

api-empire/smart-article-extractor

👁 User avatar

API Empire

👁 Smart Article & Blog Extractor avatar

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

👁 User avatar

Lightkong

👁 Smart Article Extractor avatar

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

👁 User avatar

ParseForge

👁 Linkedin Profile Post Scraper avatar

Linkedin Profile Post Scraper

scrapeengine/linkedin-profile-post-scraper

🧑‍💼 LinkedIn Profile Post Scraper extracts public posts from any profile—text, timestamps, reactions, comments, links & media. ⚙️ Ideal for content analytics, social listening, lead gen, recruiting & competitor tracking. 🚀 Fast, reliable, CSV/JSON-ready.

👁 User avatar

ScrapeEngine

👁 Airbnb Review Scraper avatar

Airbnb Review Scraper

scraper-engine/airbnb-review-scraper

Scrape Airbnb reviews from any listing with ease. This actor extracts reviewer names, dates, ratings, comments, and host responses. Ideal for market research, sentiment analysis, pricing strategy, and tracking guest experience trends.

👁 User avatar

Scraper Engine

👁 Airbnb Review Scraper avatar

Airbnb Review Scraper

simpleapi/airbnb-review-scraper

Get detailed Airbnb review data from any listing. This actor captures review text, ratings, reviewer details, dates, and host interactions. Useful for understanding guest sentiment, improving listings, or building analytics tools.

👁 User avatar

SimpleAPI

👁 Airbnb Review Scraper avatar

Airbnb Review Scraper

scrapier/airbnb-review-scraper

Automate Airbnb review scraping for research or reporting. This actor pulls reviewer names, ratings, dates, comments, and responses, giving you reliable datasets for trend tracking and business decisions.

👁 User avatar

Scrapier

URL: https://apify.com/scrapio/smart-article-extractor

⇱ 🧠 Smart Article Extractor · Apify

🧠 Smart Article Extractor

🧠 Smart Article Extractor — News & Blog Scraper

🚀 Why Choose Us?

🔥 Key Features

📥 Input

📤 Output

🚀 How to Use (Apify Console)

🤖 Use via API / MCP

💡 Best Use Cases

💰 Pricing

❓ Frequently Asked Questions

🛟 Support & Feedback

⚖️ Cautions / legal

You might also like

🧠 Smart Article Extractor

🧠 Smart Article Extractor

🧠 Smart Article Extractor

🧠 Smart Article Extractor

Smart Article & Blog Extractor

Smart Article Extractor

Linkedin Profile Post Scraper

Airbnb Review Scraper

Airbnb Review Scraper

Airbnb Review Scraper