VOOZH about

URL: https://apify.com/scraper-engine/smart-article-extractor

⇱ 🧠 Smart Article Extractor Β· Apify


πŸ‘ 🧠 Smart Article Extractor avatar

🧠 Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Go to Apify Store

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Scraper Engine

Scraper Engine

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 days ago

Last modified

Share

🧠 Smart Article Extractor β€” News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content β€” title, author, publish date, full text, summary, images, videos, in-body links and rich metadata β€” from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.


πŸš€ Why Choose Us?

FeatureSmart Article ExtractorTypical 1-URL article scraper
Bulk discovery (BFS crawler)βœ… Yes❌ One URL at a time
Sitemap & robots.txt scanningβœ… Built-in❌
Sub-domain / sub-path scopingβœ… Per Start URL❌
onlyNewArticles cross-run dedupβœ… Per-domain & global❌
Date filters (dateFrom, lastDays, mustHaveDate)βœ… All three⚠️ Limited
Anti-block proxy fallback (none β†’ DC β†’ RES)βœ… Automatic❌
Optional Playwright renderingβœ… Toggle❌
Extend-output Python hookβœ… Inline snippet❌
Live dataset push + state KVSβœ…βš οΈ

πŸ”₯ Key Features

  • πŸ“° Clean article extraction β€” trafilatura + BeautifulSoup combo for high recall.
  • 🌐 Bulk discovery β€” drop a homepage URL and the actor discovers articles via BFS.
  • πŸ—ΊοΈ Sitemap & robots.txt β€” automatic Sitemap: parsing + common candidates.
  • πŸ›‘οΈ Smart proxy fallback β€” starts direct, then datacenter, then residential.
  • 🎭 Headless browser mode β€” Playwright + Chromium for JS-heavy or protected sites.
  • 🧠 Cross-run memory β€” onlyNewArticles and onlyNewArticlesPerDomain.
  • πŸͺœ Depth / page / article caps β€” never over-crawl.
  • πŸ“… Date filters β€” dateFrom, onlyArticlesForLastDays, mustHaveDate.
  • πŸ› οΈ extendOutputFunction β€” inject your own Python extend(soup, article, html).
  • πŸ’Ύ Save HTML / snapshots β€” full HTML in-record or as KVS link, PNG screenshots.

πŸ“₯ Input

FieldTypeDefaultDescription
startUrlsarrayrequiredHomepages, sections, topic pages β€” used as crawl seeds.
articleUrlsarray[]Direct article URLs to extract (no discovery needed).
onlyNewArticlesbooleanfalseSkip URLs already seen in any previous run.
onlyNewArticlesPerDomainbooleanfalsePer-domain dedup memory.
onlyInsideArticlesbooleantrueEnqueue only same-domain links from articles.
onlySubdomainArticlesbooleanfalseRestrict to URLs sharing the Start URL path prefix.
enqueueFromArticlesbooleantrueDiscover further links inside extracted articles.
crawlWholeSubdomainbooleantrueTreat any same-subdomain link as a category candidate.
scanSitemapsbooleantrueDiscover articles from robots.txt and common sitemap paths.
useGoogleBotHeadersbooleantrueIdentify as Googlebot.
useBrowserbooleanfalseRender with headless Chromium.
scrollToBottombooleanfalseForce lazy-loaded content (browser mode only).
mustHaveDatebooleanfalseDrop articles with no detectable date.
dateFromstring (ISO date)β€”Earliest article date.
onlyArticlesForLastDaysintegerβ€”Convenience cut-off.
minWordsinteger150Reject short articles.
maxDepthinteger2BFS depth.
maxPagesPerCrawlinteger50Hard cap on fetched pages.
maxArticlesPerCrawlinteger25Hard cap on saved articles.
maxArticlesPerStartUrlinteger25Cap per Start URL.
isUrlArticleDefinitionobjectsee schemaURL-shape heuristic.
linkSelectorstringβ€”CSS selector restricting where links are collected from.
pseudoUrlsarray[]Custom URL patterns for category pages.
sitemapUrlsarray[]Explicit sitemap URLs (skip auto-discovery).
saveHtmlbooleanfalseInclude raw HTML in the dataset record.
saveHtmlAsLinkbooleanfalseSave HTML to KVS and put a link in the record.
saveSnapshotsbooleanfalsePNG screenshot (browser mode only).
extendOutputFunctionstringβ€”Python snippet β€” must define extend(soup, article, html) -> dict.
proxyConfigurationobject{useApifyProxy: false}Default = no proxy; auto-fallback to DC β†’ RES if blocked.

Example input:

{
"startUrls":[{"url":"https://www.theguardian.com"}],
"onlyArticlesForLastDays":2,
"minWords":150,
"maxArticlesPerCrawl":5,
"useGoogleBotHeaders":true,
"scanSitemaps":true,
"proxyConfiguration":{"useApifyProxy":false}
}

πŸ“€ Output

Each pushed record contains:

FieldTypeDescription
url, loadedUrlstringOriginal / resolved URL.
domain, loadedDomainstringBare host.
referrer, startUrlstringWhere the link was discovered.
depthintegerBFS depth at time of crawl.
title, softTitlestringBest-effort headline.
datestring (ISO)Publication date if found.
authorarrayAuthor URL(s) or name(s).
publisher, copyright, lang, favicon, canonicalLinkstringSite metadata.
description, keywordsstringMeta description / keywords.
tagsarrayarticle:tag values.
imagestringHero / OG image URL.
videosarray<video> / <iframe> / <source> URLs.
linksarray of {text, href}Inner-body links.
wordCountintegerWord count of the extracted text.
textstringCleaned article body.
htmlstringFull HTML (only if saveHtml / saveHtmlAsLink).
screenshotUrlstringKVS link (only if saveSnapshots + useBrowser).

Example output (truncated):

{
"url":"https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
"domain":"theguardian.com",
"title":"How often should you go to the toilet?…",
"date":"2026-05-21T04:00:02.000Z",
"author":["https://www.theguardian.com/profile/sarahphillips"],
"publisher":"the Guardian",
"wordCount":1620,
"text":"Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
"image":"https://i.guim.co.uk/img/media/…"
}

πŸš€ How to Use (Apify Console)

  1. Log in at https://console.apify.com β†’ Actors.
  2. Open Smart Article Extractor.
  3. Configure inputs (Start URLs, date filters, caps, proxy).
  4. Click Start.
  5. Watch logs in real time β€” the actor prints a per-article live feed.
  6. Open the Output tab once the run completes.
  7. Export to JSON / CSV / XLSX or wire to a webhook.

πŸ€– Use via API / MCP

curl-X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"startUrls": [{"url": "https://www.theguardian.com"}],
"maxArticlesPerCrawl": 5,
"onlyArticlesForLastDays": 2,
"proxyConfiguration": {"useApifyProxy": false}
}'

MCP-server tool name: smart-article-extractor.


πŸ’‘ Best Use Cases

  • πŸ“° News monitoring on a topic / publisher
  • πŸ“Š NLP / sentiment / summarisation datasets
  • πŸ›οΈ Brand or competitor coverage tracking
  • πŸ” SEO / SERP enrichment with full article text
  • πŸ“š Knowledge-base construction for RAG / LLMs
  • πŸ—žοΈ Press-clipping archives

πŸ’° Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.


❓ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter β€” date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS β€” if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes β€” supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.


πŸ›Ÿ Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.


βš–οΈ Cautions / legal

  • Data is collected only from publicly available sources.
  • Do not scrape private accounts or content behind authentication unless explicitly authorised.
  • The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
  • The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs β€” please be a good citizen.

You might also like

🧠 Smart Article Extractor

scrapio/smart-article-extractor

Email βœ‰οΈ & Phone πŸ“ž Extractor

scrapier/email-and-phone-extractor