VOOZH about

URL: https://apify.com/api-empire/smart-article-extractor

⇱ 🧠 Smart Article Extractor Β· Apify


πŸ‘ 🧠 Smart Article Extractor avatar

🧠 Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Go to Apify Store

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ API Empire

API Empire

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

🧠 Smart Article Extractor β€” News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content β€” title, author, publish date, full text, summary, images, videos, in-body links and rich metadata β€” from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.


πŸš€ Why Choose Us?

FeatureSmart Article ExtractorTypical 1-URL article scraper
Bulk discovery (BFS crawler)βœ… Yes❌ One URL at a time
Sitemap & robots.txt scanningβœ… Built-in❌
Sub-domain / sub-path scopingβœ… Per Start URL❌
onlyNewArticles cross-run dedupβœ… Per-domain & global❌
Date filters (dateFrom, lastDays, mustHaveDate)βœ… All three⚠️ Limited
Anti-block proxy fallback (none β†’ DC β†’ RES)βœ… Automatic❌
Optional Playwright renderingβœ… Toggle❌
Extend-output Python hookβœ… Inline snippet❌
Live dataset push + state KVSβœ…βš οΈ

πŸ”₯ Key Features

  • πŸ“° Clean article extraction β€” trafilatura + BeautifulSoup combo for high recall.
  • 🌐 Bulk discovery β€” drop a homepage URL and the actor discovers articles via BFS.
  • πŸ—ΊοΈ Sitemap & robots.txt β€” automatic Sitemap: parsing + common candidates.
  • πŸ›‘οΈ Smart proxy fallback β€” starts direct, then datacenter, then residential.
  • 🎭 Headless browser mode β€” Playwright + Chromium for JS-heavy or protected sites.
  • 🧠 Cross-run memory β€” onlyNewArticles and onlyNewArticlesPerDomain.
  • πŸͺœ Depth / page / article caps β€” never over-crawl.
  • πŸ“… Date filters β€” dateFrom, onlyArticlesForLastDays, mustHaveDate.
  • πŸ› οΈ extendOutputFunction β€” inject your own Python extend(soup, article, html).
  • πŸ’Ύ Save HTML / snapshots β€” full HTML in-record or as KVS link, PNG screenshots.

πŸ“₯ Input

FieldTypeDefaultDescription
startUrlsarrayrequiredHomepages, sections, topic pages β€” used as crawl seeds.
articleUrlsarray[]Direct article URLs to extract (no discovery needed).
onlyNewArticlesbooleanfalseSkip URLs already seen in any previous run.
onlyNewArticlesPerDomainbooleanfalsePer-domain dedup memory.
onlyInsideArticlesbooleantrueEnqueue only same-domain links from articles.
onlySubdomainArticlesbooleanfalseRestrict to URLs sharing the Start URL path prefix.
enqueueFromArticlesbooleantrueDiscover further links inside extracted articles.
crawlWholeSubdomainbooleantrueTreat any same-subdomain link as a category candidate.
scanSitemapsbooleantrueDiscover articles from robots.txt and common sitemap paths.
useGoogleBotHeadersbooleantrueIdentify as Googlebot.
useBrowserbooleanfalseRender with headless Chromium.
scrollToBottombooleanfalseForce lazy-loaded content (browser mode only).
mustHaveDatebooleanfalseDrop articles with no detectable date.
dateFromstring (ISO date)β€”Earliest article date.
onlyArticlesForLastDaysintegerβ€”Convenience cut-off.
minWordsinteger150Reject short articles.
maxDepthinteger2BFS depth.
maxPagesPerCrawlinteger50Hard cap on fetched pages.
maxArticlesPerCrawlinteger25Hard cap on saved articles.
maxArticlesPerStartUrlinteger25Cap per Start URL.
isUrlArticleDefinitionobjectsee schemaURL-shape heuristic.
linkSelectorstringβ€”CSS selector restricting where links are collected from.
pseudoUrlsarray[]Custom URL patterns for category pages.
sitemapUrlsarray[]Explicit sitemap URLs (skip auto-discovery).
saveHtmlbooleanfalseInclude raw HTML in the dataset record.
saveHtmlAsLinkbooleanfalseSave HTML to KVS and put a link in the record.
saveSnapshotsbooleanfalsePNG screenshot (browser mode only).
extendOutputFunctionstringβ€”Python snippet β€” must define extend(soup, article, html) -> dict.
proxyConfigurationobject{useApifyProxy: false}Default = no proxy; auto-fallback to DC β†’ RES if blocked.

Example input:

{
"startUrls":[{"url":"https://www.theguardian.com"}],
"onlyArticlesForLastDays":2,
"minWords":150,
"maxArticlesPerCrawl":5,
"useGoogleBotHeaders":true,
"scanSitemaps":true,
"proxyConfiguration":{"useApifyProxy":false}
}

πŸ“€ Output

Each pushed record contains:

FieldTypeDescription
url, loadedUrlstringOriginal / resolved URL.
domain, loadedDomainstringBare host.
referrer, startUrlstringWhere the link was discovered.
depthintegerBFS depth at time of crawl.
title, softTitlestringBest-effort headline.
datestring (ISO)Publication date if found.
authorarrayAuthor URL(s) or name(s).
publisher, copyright, lang, favicon, canonicalLinkstringSite metadata.
description, keywordsstringMeta description / keywords.
tagsarrayarticle:tag values.
imagestringHero / OG image URL.
videosarray<video> / <iframe> / <source> URLs.
linksarray of {text, href}Inner-body links.
wordCountintegerWord count of the extracted text.
textstringCleaned article body.
htmlstringFull HTML (only if saveHtml / saveHtmlAsLink).
screenshotUrlstringKVS link (only if saveSnapshots + useBrowser).

Example output (truncated):

{
"url":"https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
"domain":"theguardian.com",
"title":"How often should you go to the toilet?…",
"date":"2026-05-21T04:00:02.000Z",
"author":["https://www.theguardian.com/profile/sarahphillips"],
"publisher":"the Guardian",
"wordCount":1620,
"text":"Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
"image":"https://i.guim.co.uk/img/media/…"
}

πŸš€ How to Use (Apify Console)

  1. Log in at https://console.apify.com β†’ Actors.
  2. Open Smart Article Extractor.
  3. Configure inputs (Start URLs, date filters, caps, proxy).
  4. Click Start.
  5. Watch logs in real time β€” the actor prints a per-article live feed.
  6. Open the Output tab once the run completes.
  7. Export to JSON / CSV / XLSX or wire to a webhook.

πŸ€– Use via API / MCP

curl-X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"startUrls": [{"url": "https://www.theguardian.com"}],
"maxArticlesPerCrawl": 5,
"onlyArticlesForLastDays": 2,
"proxyConfiguration": {"useApifyProxy": false}
}'

MCP-server tool name: smart-article-extractor.


πŸ’‘ Best Use Cases

  • πŸ“° News monitoring on a topic / publisher
  • πŸ“Š NLP / sentiment / summarisation datasets
  • πŸ›οΈ Brand or competitor coverage tracking
  • πŸ” SEO / SERP enrichment with full article text
  • πŸ“š Knowledge-base construction for RAG / LLMs
  • πŸ—žοΈ Press-clipping archives

πŸ’° Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.


❓ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter β€” date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS β€” if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes β€” supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.


πŸ›Ÿ Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.


βš–οΈ Cautions / legal

  • Data is collected only from publicly available sources.
  • Do not scrape private accounts or content behind authentication unless explicitly authorised.
  • The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
  • The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs β€” please be a good citizen.

You might also like

Smart Article Extractor

lukaskrivka/article-extractor-smart

πŸ“° Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

πŸ‘ User avatar

LukΓ‘Ε‘ KΕ™ivka

7.6K

4.1