Pricing
$19.00/month + usage
Article Extractor & News Scraper
Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.
Pricing
$19.00/month + usage
Rating
5.0
(2)
Developer
Actor stats
3
Bookmarked
50
Total users
0
Monthly active users
6 months ago
Last modified
Share
π Apify Actor
π Python 3.12
π License: MIT
Extract articles, news content, and blog posts from any website. Get clean, structured data with title, full text, authors, publication date, images, keywords, and metadataβpowered by 7 specialized extraction engines.
π Run this Actor on Apify | π API Documentation
Table of Contents
- Features
- Use Cases
- How It Works
- Extraction Engines Comparison
- Quick Start
- Input Configuration
- Output Format
- Example Outputs
- Anti-Blocking Features
- Performance Tips
- Integrations
- Troubleshooting
- FAQ
- Changelog
β¨ Features
Core Capabilities
- π 7 Specialized Extraction Engines β Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
- π Universal Website Compatibility β Works with news sites, blogs, magazines, and any article-based content
- π Complete Content Extraction β Captures title, description, full text, authors, publication date, images, keywords, and metadata
- π Smart Fallback System β Automatically tries alternative extractors if the primary one fails
Anti-Blocking Technology
- π Browser Fingerprint Generation β Uses browserforge for realistic browser headers
- π Proxy Rotation β Automatic proxy rotation with support for residential proxies
- β±οΈ Intelligent Rate Limiting β Domain-specific delays and concurrency control
- βοΈ CloudScraper Integration β Bypasses Cloudflare and similar protections
- π¦ Google Cache Fallback β Retrieves content from Google's cache when direct access fails
Output Options
- π Plain Text β Clean, extracted article text
- π Article HTML β Preserved formatting with links and media
- π Full Page HTML β Complete webpage source for custom processing
- π Structured JSON β All metadata in a standardized format
π― Use Cases
| Industry | Application |
|---|---|
| Media Monitoring | Track news coverage, brand mentions, and competitor activity |
| Research & Academia | Collect data for NLP, sentiment analysis, and content studies |
| Content Aggregation | Build news feeds, curated content platforms, and newsletters |
| SEO Analysis | Analyze competitor content, keywords, and publishing patterns |
| Market Intelligence | Monitor industry news, trends, and market developments |
| Web Archiving | Preserve article content with full metadata |
| AI/ML Training | Generate training datasets for language models |
βοΈ How It Works
graph LRA[Input URLs] --> B[Fetch Pages]B --> C{Anti-Bot Check}C -->|Blocked| D[Rotate Proxy/Headers]D --> BC -->|Success| E[Extract Content]E --> F{Extraction OK?}F -->|No| G[Try Fallback Engine]G --> EF -->|Yes| H[Output JSON]
- Input Processing β Accepts a list of article URLs
- Smart Fetching β Uses randomized browser headers and proxy rotation
- Anti-Bot Evasion β Detects and bypasses blocking with CloudScraper and fingerprint rotation
- Content Extraction β Applies the selected extraction engine
- Fallback Logic β Automatically tries alternative engines if extraction fails
- Output Generation β Returns structured JSON with all extracted data
π Extraction Engines Comparison
| Engine | Best For | Speed | Metadata | NLP Features | Special Capabilities |
|---|---|---|---|---|---|
| Newspaper4k | General news | β‘β‘β‘ | β Full | β Yes | Summary, keywords, NER |
| Trafilatura | News & blogs | β‘β‘β‘β‘ | β Full | β No | Language detection, categories |
| Boilerpy3 | Simple articles | β‘β‘β‘β‘β‘ | β οΈ Basic | β No | Text density metrics |
| News-Please | Rich metadata | β‘β‘ | β Full | β No | Multiple fallback methods |
| Goose3 | Image extraction | β‘β‘β‘ | β Full | β No | Top image detection |
| Article Parser | HTML/Markdown | β‘β‘β‘ | β οΈ Basic | β No | Multiple output formats |
| JusText | Boilerplate removal | β‘β‘β‘β‘ | β οΈ Basic | β No | Language-aware filtering |
Recommended Engines by Content Type
- π° News Sites β Newspaper4k or Trafilatura
- π Blog Posts β Trafilatura or Goose3
- π Long-form Articles β Newspaper4k (with NLP for summarization)
- πΌοΈ Image-heavy Content β Goose3
- β‘ High-volume Scraping β Boilerpy3 or Trafilatura
- π€ Non-English Content β JusText (40+ languages supported)
π Quick Start
Run on Apify Platform
{"startUrls":["https://www.nytimes.com/2024/01/15/technology/ai-developments.html","https://www.theguardian.com/world/2024/jan/15/breaking-news"],"extractorEngine":"newspaper4k"}
Run Locally with Apify CLI
# Install Apify CLInpminstall-g apify-cli# Clone and runapify pull article-extractor-news-scrapercd article-extractor-news-scraperapify run --input='{"startUrls": ["https://example.com/article"]}'
Call via API
curl-X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~article-extractor-news-scraper/runs"\-H"Authorization: Bearer YOUR_API_TOKEN"\-H"Content-Type: application/json"\-d'{"startUrls": ["https://www.bbc.com/news/world-12345"],"extractorEngine": "newspaper4k"}'
π Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | List of article URLs to extract |
extractorEngine | string | newspaper4k | Extraction engine to use |
useFallbackExtractors | boolean | true | Try alternative engines if primary fails |
saveHtml | boolean | false | Include full page HTML in output |
saveArticleHtml | boolean | false | Include cleaned article HTML |
maxRetries | integer | 15 | Retry attempts for failed requests |
useHeaderGenerator | boolean | true | Generate realistic browser headers |
headerGeneratorOptions | object | {} | Browser/device emulation settings |
customHeaders | object | {} | Additional HTTP headers |
proxyConfiguration | object | residential | Proxy settings |
Full Input Example
{"startUrls":["https://www.nytimes.com/2024/01/15/world/article.html","https://www.theguardian.com/world/2024/jan/15/story","https://www.bbc.com/news/world-12345678"],"extractorEngine":"newspaper4k","useFallbackExtractors":true,"saveHtml":false,"saveArticleHtml":true,"maxRetries":15,"useHeaderGenerator":true,"headerGeneratorOptions":{"browsers":["chrome","firefox","safari","edge"],"devices":["desktop"]},"customHeaders":{},"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}}
π€ Output Format
Each extracted article produces a JSON object with the following fields:
Common Fields (All Engines)
| Field | Type | Description |
|---|---|---|
url | string | Original article URL |
title | string | Article headline |
text | string | Full article text (cleaned) |
sourceDomain | string | Website domain |
extractorEngine | string | Engine used for extraction |
extractedAt | string | ISO 8601 timestamp |
Extended Fields (Engine-Dependent)
| Field | Type | Available In |
|---|---|---|
description | string | newspaper4k, goose3, news-please |
author | array | newspaper4k, news-please |
publishedDate | string | newspaper4k, trafilatura, news-please |
image | string | newspaper4k, goose3, news-please |
keywords | array | newspaper4k, goose3 |
summary | string | newspaper4k |
language | string | newspaper4k, trafilatura, justext |
categories | array | trafilatura |
tags | array | trafilatura |
allImages | array | newspaper4k |
metaData | object | newspaper4k |
siteName | string | newspaper4k |
favicon | string | newspaper4k |
Metadata Fields
| Field | Type | Description |
|---|---|---|
fallbackUsed | boolean | Whether a fallback engine was used |
originalExtractor | string | Originally requested engine (if fallback used) |
fetchedFromCache | boolean | Whether content was fetched from Google Cache |
π Example Outputs
π‘οΈ Anti-Blocking Features
This Actor includes advanced anti-blocking technology to maximize success rates:
Browser Fingerprint Generation
Uses browserforge to generate realistic browser fingerprints including:
- Chrome, Firefox, Safari, and Edge user agents
- Proper
sec-ch-uaclient hints - Consistent platform and viewport data
- Session-based fingerprint persistence
Proxy Rotation
- Automatic proxy rotation on 403/429 errors
- Support for residential, datacenter, and custom proxies
- Domain-specific proxy strategies
Intelligent Rate Limiting
- Per-domain concurrency control
- Adaptive delays based on site response
- Strict mode for heavily protected sites
CloudScraper Integration
- Bypasses Cloudflare browser verification
- Handles JavaScript challenges
- Automatic cookie management
Google Cache Fallback
When direct access fails after all retries, the Actor attempts to retrieve content from Google's cache as a last resort.
β‘ Performance Tips
For Maximum Speed
{"extractorEngine":"boilerpy3","maxRetries":5,"useFallbackExtractors":false}
For Maximum Success Rate
{"extractorEngine":"newspaper4k","maxRetries":15,"useFallbackExtractors":true,"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}}
For Rich Metadata
{"extractorEngine":"newspaper4k","saveArticleHtml":true,"useFallbackExtractors":true}
π Integrations
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run_input ={"startUrls":["https://www.example.com/article"],"extractorEngine":"newspaper4k"}run = client.actor("YOUR_USERNAME/article-extractor-news-scraper").call(run_input=run_input)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"Title: {item['title']}")print(f"Text: {item['text'][:200]}...")
JavaScript/Node.js
import{ ApifyClient }from'apify-client';const client =newApifyClient({token:'YOUR_API_TOKEN'});const run =await client.actor('YOUR_USERNAME/article-extractor-news-scraper').call({startUrls:['https://www.example.com/article'],extractorEngine:'newspaper4k'});const{ items }=await client.dataset(run.defaultDatasetId).listItems();items.forEach(item=>{console.log(`Title: ${item.title}`);console.log(`Text: ${item.text.substring(0,200)}...`);});
Webhooks
Configure webhooks to receive results automatically:
{"webhooks":[{"eventTypes":["ACTOR.RUN.SUCCEEDED"],"requestUrl":"https://your-server.com/webhook"}]}
Zapier / Make (Integromat)
Use the Apify integration in Zapier or Make to connect extracted articles to:
- Google Sheets
- Notion databases
- Slack notifications
- Email newsletters
- CRM systems
π§ Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Empty text output | Anti-bot blocking | Enable residential proxies, reduce concurrency |
| 403/429 errors | Rate limiting | Increase maxRetries |
| Timeout errors | Slow server response | Increase timeout, try Google Cache |
| Missing metadata | Engine limitation | Switch to a different extraction engine |
| Garbled text | Encoding issues | Try trafilatura or newspaper4k |
Reporting Issues
If you encounter persistent issues:
- Check if the URL works in a regular browser
- Try different extraction engines
- Open an issue with:
- The problematic URL
- Your input configuration
β Frequently Asked Questions
π Changelog
v1.0.0 (December 2025)
- β¨ Initial public release
- π 7 extraction engines: Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, JusText
- π‘οΈ Advanced anti-blocking with browserforge fingerprinting
- π Automatic fallback extraction
- βοΈ Google Cache fallback for blocked pages
- π Multiple dataset views (Overview, Content, Metadata)
- βοΈ Configurable concurrency and retry settings
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Built with β€οΈ for the data extraction community
Keywords: article extractor, news scraper, web scraping, content extraction, newspaper4k, trafilatura, apify actor, python scraper, text extraction, metadata extraction, NLP, news monitoring, content aggregation
