Pricing
from $3.99 / 1,000 results
π§ Smart Article Extractor
Pricing
from $3.99 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
π§ Smart Article Extractor β News & Blog Scraper
One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content β title, author, publish date, full text, summary, images, videos, in-body links and rich metadata β from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.
π Why Choose Us?
| Feature | Smart Article Extractor | Typical 1-URL article scraper |
|---|---|---|
| Bulk discovery (BFS crawler) | β Yes | β One URL at a time |
| Sitemap & robots.txt scanning | β Built-in | β |
| Sub-domain / sub-path scoping | β Per Start URL | β |
onlyNewArticles cross-run dedup | β Per-domain & global | β |
Date filters (dateFrom, lastDays, mustHaveDate) | β All three | β οΈ Limited |
| Anti-block proxy fallback (none β DC β RES) | β Automatic | β |
| Optional Playwright rendering | β Toggle | β |
| Extend-output Python hook | β Inline snippet | β |
| Live dataset push + state KVS | β | β οΈ |
π₯ Key Features
- π° Clean article extraction β trafilatura + BeautifulSoup combo for high recall.
- π Bulk discovery β drop a homepage URL and the actor discovers articles via BFS.
- πΊοΈ Sitemap & robots.txt β automatic
Sitemap:parsing + common candidates. - π‘οΈ Smart proxy fallback β starts direct, then datacenter, then residential.
- π Headless browser mode β Playwright + Chromium for JS-heavy or protected sites.
- π§ Cross-run memory β
onlyNewArticlesandonlyNewArticlesPerDomain. - πͺ Depth / page / article caps β never over-crawl.
- π
Date filters β
dateFrom,onlyArticlesForLastDays,mustHaveDate. - π οΈ
extendOutputFunctionβ inject your own Pythonextend(soup, article, html). - πΎ Save HTML / snapshots β full HTML in-record or as KVS link, PNG screenshots.
π₯ Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Homepages, sections, topic pages β used as crawl seeds. |
articleUrls | array | [] | Direct article URLs to extract (no discovery needed). |
onlyNewArticles | boolean | false | Skip URLs already seen in any previous run. |
onlyNewArticlesPerDomain | boolean | false | Per-domain dedup memory. |
onlyInsideArticles | boolean | true | Enqueue only same-domain links from articles. |
onlySubdomainArticles | boolean | false | Restrict to URLs sharing the Start URL path prefix. |
enqueueFromArticles | boolean | true | Discover further links inside extracted articles. |
crawlWholeSubdomain | boolean | true | Treat any same-subdomain link as a category candidate. |
scanSitemaps | boolean | true | Discover articles from robots.txt and common sitemap paths. |
useGoogleBotHeaders | boolean | true | Identify as Googlebot. |
useBrowser | boolean | false | Render with headless Chromium. |
scrollToBottom | boolean | false | Force lazy-loaded content (browser mode only). |
mustHaveDate | boolean | false | Drop articles with no detectable date. |
dateFrom | string (ISO date) | β | Earliest article date. |
onlyArticlesForLastDays | integer | β | Convenience cut-off. |
minWords | integer | 150 | Reject short articles. |
maxDepth | integer | 2 | BFS depth. |
maxPagesPerCrawl | integer | 50 | Hard cap on fetched pages. |
maxArticlesPerCrawl | integer | 25 | Hard cap on saved articles. |
maxArticlesPerStartUrl | integer | 25 | Cap per Start URL. |
isUrlArticleDefinition | object | see schema | URL-shape heuristic. |
linkSelector | string | β | CSS selector restricting where links are collected from. |
pseudoUrls | array | [] | Custom URL patterns for category pages. |
sitemapUrls | array | [] | Explicit sitemap URLs (skip auto-discovery). |
saveHtml | boolean | false | Include raw HTML in the dataset record. |
saveHtmlAsLink | boolean | false | Save HTML to KVS and put a link in the record. |
saveSnapshots | boolean | false | PNG screenshot (browser mode only). |
extendOutputFunction | string | β | Python snippet β must define extend(soup, article, html) -> dict. |
proxyConfiguration | object | {useApifyProxy: false} | Default = no proxy; auto-fallback to DC β RES if blocked. |
Example input:
{"startUrls":[{"url":"https://www.theguardian.com"}],"onlyArticlesForLastDays":2,"minWords":150,"maxArticlesPerCrawl":5,"useGoogleBotHeaders":true,"scanSitemaps":true,"proxyConfiguration":{"useApifyProxy":false}}
π€ Output
Each pushed record contains:
| Field | Type | Description |
|---|---|---|
url, loadedUrl | string | Original / resolved URL. |
domain, loadedDomain | string | Bare host. |
referrer, startUrl | string | Where the link was discovered. |
depth | integer | BFS depth at time of crawl. |
title, softTitle | string | Best-effort headline. |
date | string (ISO) | Publication date if found. |
author | array | Author URL(s) or name(s). |
publisher, copyright, lang, favicon, canonicalLink | string | Site metadata. |
description, keywords | string | Meta description / keywords. |
tags | array | article:tag values. |
image | string | Hero / OG image URL. |
videos | array | <video> / <iframe> / <source> URLs. |
links | array of {text, href} | Inner-body links. |
wordCount | integer | Word count of the extracted text. |
text | string | Cleaned article body. |
html | string | Full HTML (only if saveHtml / saveHtmlAsLink). |
screenshotUrl | string | KVS link (only if saveSnapshots + useBrowser). |
Example output (truncated):
{"url":"https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toiletβ¦","domain":"theguardian.com","title":"How often should you go to the toilet?β¦","date":"2026-05-21T04:00:02.000Z","author":["https://www.theguardian.com/profile/sarahphillips"],"publisher":"the Guardian","wordCount":1620,"text":"Think balance, diversity and routine. \"Our gut is a complex machine,\" saysβ¦","image":"https://i.guim.co.uk/img/media/β¦"}
π How to Use (Apify Console)
- Log in at https://console.apify.com β Actors.
- Open Smart Article Extractor.
- Configure inputs (Start URLs, date filters, caps, proxy).
- Click Start.
- Watch logs in real time β the actor prints a per-article live feed.
- Open the Output tab once the run completes.
- Export to JSON / CSV / XLSX or wire to a webhook.
π€ Use via API / MCP
curl-X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN"\-H"Content-Type: application/json"\-d'{"startUrls": [{"url": "https://www.theguardian.com"}],"maxArticlesPerCrawl": 5,"onlyArticlesForLastDays": 2,"proxyConfiguration": {"useApifyProxy": false}}'
MCP-server tool name: smart-article-extractor.
π‘ Best Use Cases
- π° News monitoring on a topic / publisher
- π NLP / sentiment / summarisation datasets
- ποΈ Brand or competitor coverage tracking
- π SEO / SERP enrichment with full article text
- π Knowledge-base construction for RAG / LLMs
- ποΈ Press-clipping archives
π° Pricing
Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.
β Frequently Asked Questions
Q: Why are some articles skipped?
A: They failed at least one filter β date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.
Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.
Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.
Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS β if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.
Q: Can I customise the output?
A: Yes β supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.
π Support & Feedback
Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.
βοΈ Cautions / legal
- Data is collected only from publicly available sources.
- Do not scrape private accounts or content behind authentication unless explicitly authorised.
- The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
- The actor honours
robots.txtfor sitemap discovery; it does not enforce robots.txt blocks on crawl URLs β please be a good citizen.
