VOOZH about

URL: https://apify.com/codingfrontend/article-content-extractor

โ‡ฑ Article Content Extractor ยท Apify


Pricing

from $4.99 / 1,000 results

Go to Apify Store

Article Content Extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Coding Frontned

Coding Frontned

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 hours ago

Last modified

Share

Extract clean article content, metadata, images, and structured data from any web page URL. Provides title, description, author, publish date, plain text content, word count, reading time, images, links, and JSON-LD structured data.

Features

  • Extracts article body using Readability-like algorithm (tries <article>, [itemprop="articleBody"], .post-content, main, etc.)
  • Parses all meta tags: Open Graph, Twitter Cards, standard HTML meta
  • Extracts JSON-LD Schema.org structured data (Article, NewsArticle, BlogPosting, etc.)
  • Detects author, publish date, modified date, section, tags, and language
  • Collects image list with dimensions (up to 50 images)
  • Optionally gathers internal & external links
  • Calculates word count and estimated reading time (200 WPM)
  • Works with news sites, blogs, Wikipedia, Medium, and most static/SSR pages

Input

FieldTypeDefaultDescription
urlsarrayrequiredArticle URLs to extract (one per item)
includeImagesbooleantrueInclude list of images found in the article
includeLinksbooleanfalseInclude internal and external hyperlinks
includeHtmlbooleanfalseInclude cleaned article HTML in addition to plain text
extractSchemabooleantrueParse JSON-LD structured data embedded in the page
proxyConfigurationobjectโ€”Proxy settings (residential recommended for paywalled sites)

Example Input

{
"urls":[
"https://en.wikipedia.org/wiki/Artificial_intelligence",
"https://www.bbc.com/news"
],
"includeImages":true,
"includeLinks":true,
"includeHtml":false,
"extractSchema":true
}

Output

Each URL produces one dataset item:

{
"position":1,
"url":"https://en.wikipedia.org/wiki/Artificial_intelligence",
"title":"Artificial intelligence - Wikipedia",
"description":"Artificial intelligence (AI) is the intelligence of machines...",
"author":null,
"publishDate":null,
"modifiedDate":null,
"section":null,
"tags":["technology","science"],
"language":"en",
"siteName":null,
"canonical":"https://en.wikipedia.org/wiki/Artificial_intelligence",
"domain":"en.wikipedia.org",
"content":"From Wikipedia, the free encyclopedia...",
"contentHtml":null,
"wordCount":30000,
"readingTimeMinutes":150,
"ogImage":null,
"images":[
{"src":"https://upload.wikimedia.org/...","alt":"AI illustration","width":350,"height":230}
],
"internalLinks":[{"text":"Machine learning","href":"https://en.wikipedia.org/wiki/Machine_learning"}],
"externalLinks":[{"text":"Nature paper","href":"https://www.nature.com/..."}],
"schemaData":{"@type":"Article","name":"Artificial intelligence"},
"favicon":"https://en.wikipedia.org/favicon.ico",
"scrapedAt":"2025-01-01T00:00:00.000Z"
}

Dataset Views

ViewFields
Overviewposition, title, author, publishDate, domain, language, wordCount, readingTimeMinutes, url
Contenttitle, description, author, publishDate, modifiedDate, section, tags, content, wordCount, url
Mediatitle, url, images, internalLinks, externalLinks

Known Limitations

  • Paywalled / login-required sites: Actor extracts whatever is publicly visible. Pages behind auth walls may return empty content.
  • Heavy JavaScript SPAs: Content rendered by client-side JavaScript (React, Vue) may not be fully extracted. The actor waits up to 15 seconds for content to appear before extracting.
  • Author field on Wikipedia: Wikipedia pages list the authority control section as the "author", since there is no standard author meta tag. This is a limitation of relying solely on meta tags.
  • Cloudflare / Bot-protected sites: Sites protected by Cloudflare Managed Challenge, DataDome, or PerimeterX will return empty or error results. Use residential proxies to improve success rate. See the ../.github/instructions/anti-bot-bypassing.instructions.md.

Proxy

Residential proxies are recommended for news sites and paywalled content:

{
"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}
}

You might also like

Article Content Extractor ๐Ÿ“„

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. ๐Ÿ”๐Ÿ“„

Web Article Content Extractor

vulnv/web-article-content-extractor

Extract clean, readable content from news articles, blog posts, and web pages. Batch process multiple URLs, download images, bypass bot protection with proxy support. Perfect for content curation, research, and data analysis.

Article Content Extractor & Reader Scraper

taroyamada/article-content-extractor

Article content extractor + reader scraper for news, blog, and press URLs. Returns article body, byline, publish date, excerpt, and hero image. Cookie banner / nav / share-button stripping is more aggressive than off-the-shelf readability libraries.

Reddit Posts Search Scraper

easyapi/reddit-posts-search-scraper

Extract Reddit posts from search results with rich metadata, including media content, engagement metrics, and community information. Perfect for content research, trend analysis, and social media monitoring across Reddit communities.

565

5.0

(2)

Zomato Restaurant Reviews Scraper ๐Ÿฝ๏ธ

easyapi/zomato-restaurant-reviews-scraper

Scrape restaurant reviews from Zomato.com. Extract detailed review data including ratings, review text, user information, and more. Perfect for restaurant analytics, customer feedback analysis, and market research.

DentalPlans.com Dentist Scraper ๐Ÿฆท

easyapi/dentalplans-com-dentist-scraper

Extract detailed dentist information from DentalPlans.com search results, including practice details, contact info, and appointment availability. Perfect for healthcare research, provider analysis, and dental market insights. ๐Ÿฆท

32

5.0

(1)

Udemy Course Scraper ๐Ÿ“š

easyapi/udemy-course-scraper

Extract detailed course information from Udemy.com with this powerful scraper. Collect comprehensive data about online courses, including ratings, content details, instructors, and pricing. Perfect for market research, content aggregation, and educational platform development.

103

5.0

(1)