Article Content Extractor

Pricing

from $4.99 / 1,000 results

Article Content Extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

👁 Coding Frontned

Coding Frontned

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

13 hours ago

Last modified

Features

Extracts article body using Readability-like algorithm (tries <article>, [itemprop="articleBody"], .post-content, main, etc.)
Parses all meta tags: Open Graph, Twitter Cards, standard HTML meta
Extracts JSON-LD Schema.org structured data (Article, NewsArticle, BlogPosting, etc.)
Detects author, publish date, modified date, section, tags, and language
Collects image list with dimensions (up to 50 images)
Optionally gathers internal & external links
Calculates word count and estimated reading time (200 WPM)
Works with news sites, blogs, Wikipedia, Medium, and most static/SSR pages

Input

Field	Type	Default	Description
`urls`	array	required	Article URLs to extract (one per item)
`includeImages`	boolean	`true`	Include list of images found in the article
`includeLinks`	boolean	`false`	Include internal and external hyperlinks
`includeHtml`	boolean	`false`	Include cleaned article HTML in addition to plain text
`extractSchema`	boolean	`true`	Parse JSON-LD structured data embedded in the page
`proxyConfiguration`	object	—	Proxy settings (residential recommended for paywalled sites)

Example Input

{
"urls":[
"https://en.wikipedia.org/wiki/Artificial_intelligence",
"https://www.bbc.com/news"
],
"includeImages":true,
"includeLinks":true,
"includeHtml":false,
"extractSchema":true
}

Output

Each URL produces one dataset item:

{
"position":1,
"url":"https://en.wikipedia.org/wiki/Artificial_intelligence",
"title":"Artificial intelligence - Wikipedia",
"description":"Artificial intelligence (AI) is the intelligence of machines...",
"author":null,
"publishDate":null,
"modifiedDate":null,
"section":null,
"tags":["technology","science"],
"language":"en",
"siteName":null,
"canonical":"https://en.wikipedia.org/wiki/Artificial_intelligence",
"domain":"en.wikipedia.org",
"content":"From Wikipedia, the free encyclopedia...",
"contentHtml":null,
"wordCount":30000,
"readingTimeMinutes":150,
"ogImage":null,
"images":[
{"src":"https://upload.wikimedia.org/...","alt":"AI illustration","width":350,"height":230}
],
"internalLinks":[{"text":"Machine learning","href":"https://en.wikipedia.org/wiki/Machine_learning"}],
"externalLinks":[{"text":"Nature paper","href":"https://www.nature.com/..."}],
"schemaData":{"@type":"Article","name":"Artificial intelligence"},
"favicon":"https://en.wikipedia.org/favicon.ico",
"scrapedAt":"2025-01-01T00:00:00.000Z"
}

Dataset Views

View	Fields
Overview	position, title, author, publishDate, domain, language, wordCount, readingTimeMinutes, url
Content	title, description, author, publishDate, modifiedDate, section, tags, content, wordCount, url
Media	title, url, images, internalLinks, externalLinks

Known Limitations

Paywalled / login-required sites: Actor extracts whatever is publicly visible. Pages behind auth walls may return empty content.
Heavy JavaScript SPAs: Content rendered by client-side JavaScript (React, Vue) may not be fully extracted. The actor waits up to 15 seconds for content to appear before extracting.
Author field on Wikipedia: Wikipedia pages list the authority control section as the "author", since there is no standard author meta tag. This is a limitation of relying solely on meta tags.
Cloudflare / Bot-protected sites: Sites protected by Cloudflare Managed Challenge, DataDome, or PerimeterX will return empty or error results. Use residential proxies to improve success rate. See the ../.github/instructions/anti-bot-bypassing.instructions.md.

Proxy

Residential proxies are recommended for news sites and paywalled content:

{
"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}
}

👁 Article Content Extractor 📄 avatar

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

👁 User avatar

EasyApi

134

👁 Web Article Content Extractor avatar

Web Article Content Extractor

vulnv/web-article-content-extractor

Extract clean, readable content from news articles, blog posts, and web pages. Batch process multiple URLs, download images, bypass bot protection with proxy support. Perfect for content curation, research, and data analysis.

👁 User avatar

VulnV

👁 Article Content Extractor & Reader Scraper avatar

Article Content Extractor & Reader Scraper

taroyamada/article-content-extractor

Article content extractor + reader scraper for news, blog, and press URLs. Returns article body, byline, publish date, excerpt, and hero image. Cookie banner / nav / share-button stripping is more aggressive than off-the-shelf readability libraries.

👁 User avatar

naoki anzai

Medium Article Scraper — Content & Author Extraction

oneary/medium-scraper

Scrape Medium articles by topic, tag or publication — extract full text, author, claps, responses and metadata for content analysis.

👁 User avatar

Luan M.

Generic Articles Main Content Extractor

nlp_data_lni/generic-articles-content-extractor

Extract the main content of articles. Input can be article links or pages from which to identify and extract article links. Articles are scraped and cleaned to extract the main text and many useful metadatas. Search terms and date post filters can be applied and highlighted snippets produced.

👁 User avatar

LilaK

Wikipedia Scraper - Article Content Extractor

lulzasaur/wikipedia-scraper

Scrape Wikipedia articles. Search by topic and extract full structured content: summaries, sections, infobox data, categories, references, images, and edit history for any article.

👁 User avatar

lulz bot

👁 Reddit Posts Search Scraper avatar

Reddit Posts Search Scraper

easyapi/reddit-posts-search-scraper

Extract Reddit posts from search results with rich metadata, including media content, engagement metrics, and community information. Perfect for content research, trend analysis, and social media monitoring across Reddit communities.

👁 User avatar

EasyApi

565

5.0

(2)

👁 Zomato Restaurant Reviews Scraper 🍽️ avatar

Zomato Restaurant Reviews Scraper 🍽️

easyapi/zomato-restaurant-reviews-scraper

Scrape restaurant reviews from Zomato.com. Extract detailed review data including ratings, review text, user information, and more. Perfect for restaurant analytics, customer feedback analysis, and market research.

👁 User avatar

EasyApi

111

👁 DentalPlans.com Dentist Scraper 🦷 avatar

DentalPlans.com Dentist Scraper 🦷

easyapi/dentalplans-com-dentist-scraper

Extract detailed dentist information from DentalPlans.com search results, including practice details, contact info, and appointment availability. Perfect for healthcare research, provider analysis, and dental market insights. 🦷

👁 User avatar

EasyApi

5.0

(1)

👁 Udemy Course Scraper 📚 avatar

Udemy Course Scraper 📚

easyapi/udemy-course-scraper

Extract detailed course information from Udemy.com with this powerful scraper. Collect comprehensive data about online courses, including ratings, content details, instructors, and pricing. Perfect for market research, content aggregation, and educational platform development.

👁 User avatar

EasyApi

103

5.0

(1)

URL: https://apify.com/codingfrontend/article-content-extractor

⇱ Article Content Extractor · Apify

Article Content Extractor

Features

Input

Example Input

Output

Dataset Views

Known Limitations

Proxy

You might also like

Article Content Extractor 📄

Web Article Content Extractor

Article Content Extractor & Reader Scraper

Medium Article Scraper — Content & Author Extraction

Generic Articles Main Content Extractor

Wikipedia Scraper - Article Content Extractor

Reddit Posts Search Scraper

Zomato Restaurant Reviews Scraper 🍽️

DentalPlans.com Dentist Scraper 🦷

Udemy Course Scraper 📚