CNN Article Scraper

Pricing

Pay per usage

CNN Article Scraper

Extract CNN articles by category or search query with date filtering. Scrape news from politics, business, world, tech, sports, and more. Get structured data: title, author, publication date, full content. Perfect for media monitoring, research, and content analysis.

Pricing

Pay per usage

Rating

5.0

(2)

Developer

👁 Filip Cicvárek

Filip Cicvárek

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

7 months ago

Last modified

What does CNN Article Scraper do?

This Actor retrieves articles from CNN.com based on your specified criteria:

Category-based scraping: Extract articles from specific CNN sections (politics, business, world news, etc.)
Search-based scraping: Find articles matching specific keywords or topics
Date filtering: Precisely control the publication time window
Concurrent processing: Adjust scraping speed with configurable concurrency
Structured output: Get clean, organized article data including title, author, publication date, full content, and URL

Use Cases

Media Monitoring: Track CNN's coverage of specific topics or events over time and identify trends in news reporting
Market Research: Analyze business and technology news for competitive intelligence, industry trends, and market insights
Academic Research: Collect news articles for content analysis, sentiment studies, or media studies research projects
Content Aggregation: Build news feeds or newsletters by automatically collecting relevant CNN articles within specific timeframes
Competitive Analysis: Track how CNN covers your industry, competitors, or specific topics compared to other news sources

Input Highlights

category / searchQuery: Provide at least one. Supplying both returns only overlapping articles.
categoryMode: 'latest' (fast landing page scan), 'archive' (monthly sitemap crawl for deep history), or 'auto' (default heuristic based on your date window).
archiveMonthLimit: Caps how many monthly sitemap files are loaded when archive mode runs. Increase for longer ranges, but expect slower runs.
maxArticles: Set to 0 for no limit; the Actor keeps going until it exhausts collected links.
concurrency: Controls how many article detail pages are fetched in parallel.
Oldest-first processing: Sitemap discoveries are sorted chronologically, guaranteeing the Actor starts with the oldest articles inside your window.

Example Input

{
"category":"world",
"startDate":"2025-03-01",
"endDate":"2025-10-10",
"maxArticles":100,
"concurrency":5,
"categoryMode":"auto",
"archiveMonthLimit":12
}

Output Format

Each scraped article is stored as a separate item in the dataset with the following structure:

{
"title":"Article headline",
"author":"Reporter Name",
"publicationDate":"2025-01-15",
"updatedDate":"2025-01-20",
"content":"Full article text content...",
"url":"https://www.cnn.com/2025/01/15/politics/article-slug/index.html",
"scrapedAt":"2025-10-10T14:30:00.000Z"
}

Output Fields

title: Article headline as it appears on CNN
author: Article author(s) name or "Unknown" if not found
publicationDate: Publication date in YYYY-MM-DD format
updatedDate: Last updated date in YYYY-MM-DD format when available
content: Full article text with paragraphs separated by double line breaks
url: Direct link to the article on CNN.com
scrapedAt: ISO timestamp of when the article was scraped

Features

✅ Dual scraping modes: Category browsing or keyword search
✅ Archive-aware category discovery: Navigates CNN’s live article sitemaps (with RSS fallback when sitemaps are unavailable)
✅ Precise date filtering: Only scrapes articles within your specified date range
✅ Early filtering optimization: Filters articles by date before scraping full content
✅ Automatic retry logic: Handles temporary network errors with built-in retry mechanism
✅ Concurrent processing: Adjustable parallelization for faster scraping
✅ Clean content extraction: Filters out ads, JavaScript code, and non-article content
✅ Structured data output: Consistent JSON format for easy integration
✅ Duplicate prevention: Automatically removes duplicate article URLs
✅ Pay-per-use pricing: Only pay for what you scrape
✅ Chronological batching: Prioritises the oldest articles inside your date window so you see early coverage first

Performance & Limits

Speed Optimization

Concurrency: Higher concurrency speeds up scraping but uses more resources
Date filtering: Early date filtering reduces unnecessary requests
Batch processing: Articles are processed in batches based on concurrency setting
Archive mode: Sitemap downloads add latency; when CNN blocks sitemap access the Actor falls back to RSS feeds (coverage may be narrower), so reduce the date range or archiveMonthLimit when you only need recent content. Sitemaps supply hundreds of URLs per month, so consider lowering maxArticles if you only need a subset.

Recommended Settings

For quick tests: maxArticles: 10, concurrency: 1
For moderate scraping: maxArticles: 100, concurrency: 5
For large-scale scraping: maxArticles: 0 (unlimited), concurrency: 10-15
For historical digging: categoryMode: "archive", widen archiveMonthLimit to cover every month in your range, and be prepared for longer runtimes

Troubleshooting

No articles found

Problem: Actor completes but returns zero articles.

Solutions:

For older timeframes, switch categoryMode to "archive" (or increase archiveMonthLimit) so the Actor scans the CNN sitemap.
Verify your date range includes actual published articles—try widening the window temporarily.
Check if the category URL structure has changed
Try using searchQuery instead of category for more reliable results

Missing author or content

Problem: Some fields return "Unknown" or empty content.

Solutions:

CNN's HTML structure varies by article type. Some articles (videos, opinion pieces) may have different layouts
The Actor uses multiple selectors to extract data but cannot guarantee 100% success for all article types
Consider filtering results by checking for non-empty fields in your post-processing

Scraping too slow

Problem: Actor takes too long to complete.

Solutions:

Increase concurrency to 10-15 for faster parallel processing
Reduce maxArticles if you don't need all available articles
Narrow your date range to reduce the number of articles to process

Limitations

The Actor scrapes publicly available CNN articles only
Article structure may vary, affecting data extraction accuracy
Very old articles may have different HTML structures
Category archive filtering uses URL keywords; niche sub-sections may require a search query for full coverage
Sitemap responses can list hundreds of URLs for a single month; the Actor trims to the oldest archiveMonthLimit months to control runtime
CNN occasionally throttles or withholds sitemap data; in those cases the RSS fallback only exposes the stories the feeds provide
CNN may update their website structure, requiring Actor maintenance
Search API results are limited to what CNN makes available through their search service

Support

Need help or have questions about this Actor?

Open an issue in the Actor's Issues tab
Check the Apify documentation for general platform guidance
Review this README for configuration and troubleshooting tips

Feedback

If you found this Actor helpful, please leave a review on the Actor page. Your feedback helps improve the Actor and helps other users discover it.

Pricing: This Actor uses pay-per-use pricing. You only pay for the compute resources consumed during scraping. See the Apify pricing page for current rates.

👁 CNN Articles Scraper | US and World News Headlines avatar

CNN Articles Scraper | US and World News Headlines

parseforge/cnn-articles-scraper

Extract CNN articles with headline, byline, date, section, summary, and full body. Filter by topic, region, or keyword. Useful for media monitoring, sentiment analysis, NLP training datasets, and competitive intelligence across US and international news.

👁 User avatar

ParseForge

👁 CNN Transcripts Scraper avatar

CNN Transcripts Scraper

jungle_synthesizer/cnn-transcripts-scraper

Scrape broadcast transcripts from transcripts.cnn.com. Extracts full segment text, speaker labels, show metadata, and airtime info for any CNN show and date range.

👁 User avatar

BowTiedRaccoon

👁 CNN Top Headlines & Article Scraper avatar

CNN Top Headlines & Article Scraper

runtime/cnn-top-headlines

Scrape CNN top headlines and optional article details with titles, URLs, authors, published dates, content snippets, and scrape timestamps.

👁 User avatar

scraping automation

👁 CNN Business Stock Price avatar

CNN Business Stock Price

pintostudio/cnn-business-stock-price

The CNN Business Stock Price Actor is a web scraping tool that fetches real-time stock price data and financial analysis from CNN Business. This actor provides comprehensive stock information including current prices.

👁 User avatar

Pinto Studio

👁 CNN Business Stock Total Revenue avatar

CNN Business Stock Total Revenue

pintostudio/cnn-business-stock-total-revenue

A specialized Apify Actor that fetches total revenue data for US stock companies from CNN Business.

👁 User avatar

Pinto Studio

Medium Article Scraper

cloud9_ai/medium-article-scraper

Extract articles from Medium: title, author, publication, tags, claps, responses, read time, publication date, content preview. Scrape by tag, author, publication, or search query. Uses RSS feeds for reliability. Perfect for content research, trend analysis.

👁 User avatar

cloud9

👁 CNN Business Stock Earnings Per Share avatar

CNN Business Stock Earnings Per Share

pintostudio/cnn-business-stock-earnings-per-share

The CNN Business Stock Earnings Per Share Actor is an Apify actor that retrieves comprehensive earnings per share (EPS) data for US stock tickers from CNN Business.

👁 User avatar

Pinto Studio

👁 Google News Article Scraper avatar

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

👁 User avatar

WebScrap

Google News Scraper

fortuitous_pirate/google-news-scraper

Scrape news articles from Google News by search query or topic. Extracts article title, source, published date, and URL. Supports language and country filtering. Export to JSON, CSV, or Excel.

👁 User avatar

Fortuitous Pirate

👁 RSS Feed Scraper — News Scraper & Article Extractor avatar

RSS Feed Scraper — News Scraper & Article Extractor

scrapepilot/rss-feed-scraper----news-scraper-article-extractor

Scrape any RSS or Atom news feed. Get article title, URL, description, author, date, category, and image. 20+ built-in presets: BBC, Reuters, TechCrunch, CNN, NYT, Wired & more. Optional full article text. No login. $6.99/month. 2-hour free trial.

👁 User avatar

Scrape Pilot

URL: https://apify.com/filip_cicvarek/cnn-article-scraper