👁 Article Extractor & News Scraper avatar

Article Extractor & News Scraper

Pricing

$19.00/month + usage

Article Extractor & News Scraper

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Pricing

$19.00/month + usage

Rating

5.0

(2)

Developer

👁 Web Harvester

Web Harvester

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 months ago

Last modified

✨ Features

Core Capabilities

🔍 7 Specialized Extraction Engines — Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
🌐 Universal Website Compatibility — Works with news sites, blogs, magazines, and any article-based content
📊 Complete Content Extraction — Captures title, description, full text, authors, publication date, images, keywords, and metadata
🔄 Smart Fallback System — Automatically tries alternative extractors if the primary one fails

Anti-Blocking Technology

🎭 Browser Fingerprint Generation — Uses browserforge for realistic browser headers
🔀 Proxy Rotation — Automatic proxy rotation with support for residential proxies
⏱️ Intelligent Rate Limiting — Domain-specific delays and concurrency control
☁️ CloudScraper Integration — Bypasses Cloudflare and similar protections
📦 Google Cache Fallback — Retrieves content from Google's cache when direct access fails

Output Options

📝 Plain Text — Clean, extracted article text
🔖 Article HTML — Preserved formatting with links and media
📄 Full Page HTML — Complete webpage source for custom processing
📋 Structured JSON — All metadata in a standardized format

🎯 Use Cases

Industry	Application
Media Monitoring	Track news coverage, brand mentions, and competitor activity
Research & Academia	Collect data for NLP, sentiment analysis, and content studies
Content Aggregation	Build news feeds, curated content platforms, and newsletters
SEO Analysis	Analyze competitor content, keywords, and publishing patterns
Market Intelligence	Monitor industry news, trends, and market developments
Web Archiving	Preserve article content with full metadata
AI/ML Training	Generate training datasets for language models

⚙️ How It Works

graph LR
 A[Input URLs] --> B[Fetch Pages]
 B --> C{Anti-Bot Check}
 C -->|Blocked| D[Rotate Proxy/Headers]
 D --> B
 C -->|Success| E[Extract Content]
 E --> F{Extraction OK?}
 F -->|No| G[Try Fallback Engine]
 G --> E
 F -->|Yes| H[Output JSON]

Input Processing — Accepts a list of article URLs
Smart Fetching — Uses randomized browser headers and proxy rotation
Anti-Bot Evasion — Detects and bypasses blocking with CloudScraper and fingerprint rotation
Content Extraction — Applies the selected extraction engine
Fallback Logic — Automatically tries alternative engines if extraction fails
Output Generation — Returns structured JSON with all extracted data

📊 Extraction Engines Comparison

Engine	Best For	Speed	Metadata	NLP Features	Special Capabilities
Newspaper4k	General news	⚡⚡⚡	✅ Full	✅ Yes	Summary, keywords, NER
Trafilatura	News & blogs	⚡⚡⚡⚡	✅ Full	❌ No	Language detection, categories
Boilerpy3	Simple articles	⚡⚡⚡⚡⚡	⚠️ Basic	❌ No	Text density metrics
News-Please	Rich metadata	⚡⚡	✅ Full	❌ No	Multiple fallback methods
Goose3	Image extraction	⚡⚡⚡	✅ Full	❌ No	Top image detection
Article Parser	HTML/Markdown	⚡⚡⚡	⚠️ Basic	❌ No	Multiple output formats
JusText	Boilerplate removal	⚡⚡⚡⚡	⚠️ Basic	❌ No	Language-aware filtering

🚀 Quick Start

Run on Apify Platform

{
"startUrls":[
"https://www.nytimes.com/2024/01/15/technology/ai-developments.html",
"https://www.theguardian.com/world/2024/jan/15/breaking-news"
],
"extractorEngine":"newspaper4k"
}

Run Locally with Apify CLI

# Install Apify CLI
npminstall-g apify-cli
# Clone and run
apify pull article-extractor-news-scraper
cd article-extractor-news-scraper
apify run --input='{"startUrls": ["https://example.com/article"]}'

Call via API

curl-X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~article-extractor-news-scraper/runs"\
-H"Authorization: Bearer YOUR_API_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "startUrls": ["https://www.bbc.com/news/world-12345"],
 "extractorEngine": "newspaper4k"
 }'

📝 Input Configuration

Parameter	Type	Default	Description
`startUrls`	array	required	List of article URLs to extract
`extractorEngine`	string	`newspaper4k`	Extraction engine to use
`useFallbackExtractors`	boolean	`true`	Try alternative engines if primary fails
`saveHtml`	boolean	`false`	Include full page HTML in output
`saveArticleHtml`	boolean	`false`	Include cleaned article HTML
`maxRetries`	integer	`15`	Retry attempts for failed requests
`useHeaderGenerator`	boolean	`true`	Generate realistic browser headers
`headerGeneratorOptions`	object	`{}`	Browser/device emulation settings
`customHeaders`	object	`{}`	Additional HTTP headers
`proxyConfiguration`	object	residential	Proxy settings

Full Input Example

{
"startUrls":[
"https://www.nytimes.com/2024/01/15/world/article.html",
"https://www.theguardian.com/world/2024/jan/15/story",
"https://www.bbc.com/news/world-12345678"
],
"extractorEngine":"newspaper4k",
"useFallbackExtractors":true,
"saveHtml":false,
"saveArticleHtml":true,
"maxRetries":15,
"useHeaderGenerator":true,
"headerGeneratorOptions":{
"browsers":["chrome","firefox","safari","edge"],
"devices":["desktop"]
},
"customHeaders":{},
"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}
}

📤 Output Format

Each extracted article produces a JSON object with the following fields:

Common Fields (All Engines)

Field	Type	Description
`url`	string	Original article URL
`title`	string	Article headline
`text`	string	Full article text (cleaned)
`sourceDomain`	string	Website domain
`extractorEngine`	string	Engine used for extraction
`extractedAt`	string	ISO 8601 timestamp

Extended Fields (Engine-Dependent)

Field	Type	Available In
`description`	string	newspaper4k, goose3, news-please
`author`	array	newspaper4k, news-please
`publishedDate`	string	newspaper4k, trafilatura, news-please
`image`	string	newspaper4k, goose3, news-please
`keywords`	array	newspaper4k, goose3
`summary`	string	newspaper4k
`language`	string	newspaper4k, trafilatura, justext
`categories`	array	trafilatura
`tags`	array	trafilatura
`allImages`	array	newspaper4k
`metaData`	object	newspaper4k
`siteName`	string	newspaper4k
`favicon`	string	newspaper4k

Metadata Fields

Field	Type	Description
`fallbackUsed`	boolean	Whether a fallback engine was used
`originalExtractor`	string	Originally requested engine (if fallback used)
`fetchedFromCache`	boolean	Whether content was fetched from Google Cache

📋 Example Outputs

🛡️ Anti-Blocking Features

This Actor includes advanced anti-blocking technology to maximize success rates:

Browser Fingerprint Generation

Uses browserforge to generate realistic browser fingerprints including:

Chrome, Firefox, Safari, and Edge user agents
Proper sec-ch-ua client hints
Consistent platform and viewport data
Session-based fingerprint persistence

Proxy Rotation

Automatic proxy rotation on 403/429 errors
Support for residential, datacenter, and custom proxies
Domain-specific proxy strategies

Intelligent Rate Limiting

Per-domain concurrency control
Adaptive delays based on site response
Strict mode for heavily protected sites

CloudScraper Integration

Bypasses Cloudflare browser verification
Handles JavaScript challenges
Automatic cookie management

Google Cache Fallback

When direct access fails after all retries, the Actor attempts to retrieve content from Google's cache as a last resort.

⚡ Performance Tips

For Maximum Speed

{
"extractorEngine":"boilerpy3",
"maxRetries":5,
"useFallbackExtractors":false
}

For Maximum Success Rate

{
"extractorEngine":"newspaper4k",
"maxRetries":15,
"useFallbackExtractors":true,
"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}
}

For Rich Metadata

{
"extractorEngine":"newspaper4k",
"saveArticleHtml":true,
"useFallbackExtractors":true
}

🔌 Integrations

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run_input ={
"startUrls":["https://www.example.com/article"],
"extractorEngine":"newspaper4k"
}
run = client.actor("YOUR_USERNAME/article-extractor-news-scraper").call(run_input=run_input)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"Title: {item['title']}")
print(f"Text: {item['text'][:200]}...")

JavaScript/Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('YOUR_USERNAME/article-extractor-news-scraper').call({
startUrls:['https://www.example.com/article'],
extractorEngine:'newspaper4k'
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item=>{
 console.log(`Title: ${item.title}`);
 console.log(`Text: ${item.text.substring(0,200)}...`);
});

Webhooks

Configure webhooks to receive results automatically:

{
"webhooks":[
{
"eventTypes":["ACTOR.RUN.SUCCEEDED"],
"requestUrl":"https://your-server.com/webhook"
}
]
}

Zapier / Make (Integromat)

Use the Apify integration in Zapier or Make to connect extracted articles to:

Google Sheets
Notion databases
Slack notifications
Email newsletters
CRM systems

🔧 Troubleshooting

Common Issues

Issue	Cause	Solution
Empty text output	Anti-bot blocking	Enable residential proxies, reduce concurrency
403/429 errors	Rate limiting	Increase `maxRetries`
Timeout errors	Slow server response	Increase timeout, try Google Cache
Missing metadata	Engine limitation	Switch to a different extraction engine
Garbled text	Encoding issues	Try trafilatura or newspaper4k

Reporting Issues

If you encounter persistent issues:

Check if the URL works in a regular browser
Try different extraction engines
Open an issue with:
- The problematic URL
- Your input configuration

❓ Frequently Asked Questions

📝 Changelog

v1.0.0 (December 2025)

✨ Initial public release
🔍 7 extraction engines: Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, JusText
🛡️ Advanced anti-blocking with browserforge fingerprinting
🔄 Automatic fallback extraction
☁️ Google Cache fallback for blocked pages
📊 Multiple dataset views (Overview, Content, Metadata)
⚙️ Configurable concurrency and retry settings

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Built with ❤️ for the data extraction community

Keywords: article extractor, news scraper, web scraping, content extraction, newspaper4k, trafilatura, apify actor, python scraper, text extraction, metadata extraction, NLP, news monitoring, content aggregation

👁 Google News Scraper avatar

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

👁 User avatar

Rush

105

5.0

👁 News Website Crawler & Article Extractor avatar

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

👁 User avatar

Xtech

402

4.8

👁 Google News Scraper avatar

Google News Scraper

crawlerbros/google-news-scraper

Scrape Google News in real-time. Supports keyword search, date filters, full-text article extraction, and image extraction.

👁 User avatar

Crawler Bros

140

5.0

👁 Google News Scraper avatar

Google News Scraper

easyapi/google-news-scraper

Powerful Google News scraper, collect up to 5000 news articles with flexible search options, language support. Perfect for news aggregation, market research, and sentiment analysis. 📰🔍

👁 User avatar

EasyApi

1.8K

3.8

👁 Free Google News API — Search News by Keyword + Country avatar

Free Google News API — Search News by Keyword + Country

s-r/google-news

Free Google News scraper — get clean structured news results for any query, country, and language. Use it as a Google News API for brand monitoring, topic alerts, news clipping, and bulk article URL harvesting.

👁 User avatar

👁 Google News Scraper avatar

Google News Scraper

lhotanova/google-news-scraper

Gets featured articles from Google News with title, link, source, publication date and image.

👁 User avatar

Kristýna Lhoťanová

3.1K

4.6

👁 Google News Scraper avatar

Google News Scraper

epctex/google-news-scraper

Unlock timely news insights with our Google News data retrieval tool. Get the latest news on any news at any time, and more. Effortless and powerful. 📰🔍 #NewsData

👁 User avatar

epctex

589

5.0

👁 Google News Realtime Scraper avatar

Google News Realtime Scraper

devisty/google-news

Provide real-time news and articles sourced from Google News

👁 User avatar

Devisty

258

👁 Smart Article Extractor avatar

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

👁 User avatar

Lukáš Křivka

7.6K

4.1

👁 Google News Scraper (Pay Per Event) avatar

Google News Scraper (Pay Per Event)

data_xplorer/google-news-scraper-fast

Scrape Google News in real time, including images and descriptions. This tool extracts complete structured data: high-resolution visuals, full, titles, sources, dates, and direct URLs.

👁 User avatar

Data Xplorer

4.8

URL: https://apify.com/web.harvester/article-extractor-news-scraper

⇱ Article Extractor - News & Blog Scraper API · Apify

Article Extractor & News Scraper

Table of Contents

✨ Features

Core Capabilities

Anti-Blocking Technology

Output Options

🎯 Use Cases

⚙️ How It Works

📊 Extraction Engines Comparison

Recommended Engines by Content Type

🚀 Quick Start

Run on Apify Platform

Run Locally with Apify CLI

Call via API

📝 Input Configuration

Full Input Example

📤 Output Format

Common Fields (All Engines)

Extended Fields (Engine-Dependent)

Metadata Fields

📋 Example Outputs

🛡️ Anti-Blocking Features

Browser Fingerprint Generation

Proxy Rotation

Intelligent Rate Limiting

CloudScraper Integration

Google Cache Fallback

⚡ Performance Tips

For Maximum Speed

For Maximum Success Rate

For Rich Metadata

🔌 Integrations

Python

JavaScript/Node.js

Webhooks

Zapier / Make (Integromat)

🔧 Troubleshooting

Common Issues

Reporting Issues

❓ Frequently Asked Questions

📝 Changelog

v1.0.0 (December 2025)

📄 License

🤝 Contributing

You might also like

Google News Scraper

News Website Crawler & Article Extractor

Google News Scraper

Google News Scraper

Free Google News API — Search News by Keyword + Country

Google News Scraper

Google News Scraper

Google News Realtime Scraper

Smart Article Extractor

Google News Scraper (Pay Per Event)