Webpage Text Extractor

Pricing

Pay per event

Webpage Text Extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does Webpage Text Extractor do?

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.

Use cases

AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
Data journalists -- collect article text from multiple news sources for comparison and analysis
Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
Data pipeline builders -- feed clean, structured text into downstream processing tools and databases

Why use Webpage Text Extractor?

AI-ready clean text -- strips all HTML, scripts, styles, and ads to return structured output ready for LLM training, RAG pipelines, and AI agent workflows
Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
Link extraction -- captures all links with text, href, and internal/external classification
Configurable metadata -- toggle metadata inclusion with the includeMetadata option to control output size
Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription

Input parameters

Parameter	Type	Required	Default	Description
`urls`	string[]	Yes	--	List of web page URLs to extract text from
`includeMetadata`	boolean	No	`true`	Include links and extra metadata in the output

Example input

{
"urls":[
"https://en.wikipedia.org/wiki/Web_scraping",
"https://blog.apify.com"
],
"includeMetadata":true
}

Output example

{
"url":"https://en.wikipedia.org/wiki/Web_scraping",
"title":"Web scraping - Wikipedia",
"metaDescription":"...",
"author":null,
"publishedDate":null,
"language":"en",
"mainText":"Web scraping is the process of...",
"headings":[
{"level":1,"text":"Web scraping"},
{"level":2,"text":"Techniques"}
],
"links":[
{"text":"data extraction","href":"/wiki/Data_extraction","isExternal":false}
],
"wordCount":3450,
"charCount":21000,
"error":null,
"extractedAt":"2026-03-01T12:00:00.000Z"
}

Output fields

Field	Type	Description
`url`	string	The extracted page URL
`title`	string	The page title
`metaDescription`	string	The meta description tag content
`author`	string	Author name if detected from meta tags
`publishedDate`	string	Publish date if detected from meta tags
`language`	string	Page language from the lang attribute
`mainText`	string	Clean text content with HTML stripped
`headings`	array	List of headings with level (1-6) and text
`links`	array	List of links with text, href, and isExternal flag
`wordCount`	number	Total words in the extracted text
`charCount`	number	Total characters in the extracted text
`error`	string	Error message if extraction failed, null otherwise
`extractedAt`	string	ISO timestamp of the extraction

How to extract text from web pages

Open Webpage Text Extractor on Apify.
Enter one or more web page URLs in the urls field.
Choose whether to include metadata (links, headings, author info) by setting includeMetadata.
Click Start and wait for the run to finish.
Download the extracted text as JSON, CSV, or Excel from the Dataset tab.

How much does it cost to extract text from web pages?

Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

Event	Price	Description
Start	$0.035	One-time per run
URL extracted	$0.001	Per page extracted

Example costs:

10 pages: $0.035 + 10 x $0.001 = $0.045
100 pages: $0.035 + 100 x $0.001 = $0.135
1,000 pages: $0.035 + 1,000 x $0.001 = $1.035

Using the Apify API

You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('automation-lab/webpage-text-extractor').call({
urls:['https://en.wikipedia.org/wiki/Web_scraping'],
includeMetadata:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/webpage-text-extractor').call(run_input={
'urls':['https://en.wikipedia.org/wiki/Web_scraping'],
'includeMetadata':True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl-X POST "https://api.apify.com/v2/acts/automation-lab~webpage-text-extractor/runs?token=YOUR_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "urls": ["https://en.wikipedia.org/wiki/Web_scraping"],
 "includeMetadata": true
 }'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add--transport http apify "https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"
}
}
}

Example prompts

"Extract the main text content from this article: https://example.com/blog/post"
"Get clean text from these web pages and summarize them"
"How many words are on this page and what is the heading structure?"

Learn more in the Apify MCP documentation.

Integrations

Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.

Tips and best practices

Set includeMetadata to false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links
Use the headings array to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A
Filter by language when processing multilingual sites to route text to the correct NLP model or translation pipeline
Combine with Content Readability Checker to get both the raw text and readability scores for each page
Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export

Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

FAQ

Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.

What is the mainText field? It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.

Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.

The extracted text includes navigation menu and footer text. How do I get only the article content? The actor tries to detect the main content area using semantic HTML elements (<article>, <main>). If the website does not use these elements, the actor falls back to <body> and strips common non-content elements. Check the contentArea field in the output -- if it says "body", the site likely lacks proper semantic markup, which can cause nav/footer text to be included.

The actor returns very little or no text for a page that has content. Why? The page likely loads its content via client-side JavaScript (React, Angular, Vue, etc.). The actor uses plain HTTP requests and parses the initial HTML response without executing JavaScript. For JavaScript-heavy sites, you may need a browser-based scraping solution.

URL: https://apify.com/automation-lab/webpage-text-extractor