VOOZH about

URL: https://apify.com/automation-lab/webpage-text-extractor

โ‡ฑ Webpage Text Extractor โ€” Extract Clean Text from URLs ยท Apify


Pricing

Pay per event

Go to Apify Store

Webpage Text Extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Pricing

Pay per event

Rating

0.0

(0)

Developer

๐Ÿ‘ Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

66

Total users

14

Monthly active users

3 months ago

Last modified

Share

Extract clean text content from web pages. Strips HTML and returns structured text with headings, links, metadata, and word count.

What does Webpage Text Extractor do?

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.

Use cases

  • AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
  • Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
  • Data journalists -- collect article text from multiple news sources for comparison and analysis
  • Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
  • Data pipeline builders -- feed clean, structured text into downstream processing tools and databases

Why use Webpage Text Extractor?

  • AI-ready clean text -- strips all HTML, scripts, styles, and ads to return structured output ready for LLM training, RAG pipelines, and AI agent workflows
  • Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
  • Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
  • Link extraction -- captures all links with text, href, and internal/external classification
  • Configurable metadata -- toggle metadata inclusion with the includeMetadata option to control output size
  • Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]Yes--List of web page URLs to extract text from
includeMetadatabooleanNotrueInclude links and extra metadata in the output

Example input

{
"urls":[
"https://en.wikipedia.org/wiki/Web_scraping",
"https://blog.apify.com"
],
"includeMetadata":true
}

Output example

{
"url":"https://en.wikipedia.org/wiki/Web_scraping",
"title":"Web scraping - Wikipedia",
"metaDescription":"...",
"author":null,
"publishedDate":null,
"language":"en",
"mainText":"Web scraping is the process of...",
"headings":[
{"level":1,"text":"Web scraping"},
{"level":2,"text":"Techniques"}
],
"links":[
{"text":"data extraction","href":"/wiki/Data_extraction","isExternal":false}
],
"wordCount":3450,
"charCount":21000,
"error":null,
"extractedAt":"2026-03-01T12:00:00.000Z"
}

Output fields

FieldTypeDescription
urlstringThe extracted page URL
titlestringThe page title
metaDescriptionstringThe meta description tag content
authorstringAuthor name if detected from meta tags
publishedDatestringPublish date if detected from meta tags
languagestringPage language from the lang attribute
mainTextstringClean text content with HTML stripped
headingsarrayList of headings with level (1-6) and text
linksarrayList of links with text, href, and isExternal flag
wordCountnumberTotal words in the extracted text
charCountnumberTotal characters in the extracted text
errorstringError message if extraction failed, null otherwise
extractedAtstringISO timestamp of the extraction

How to extract text from web pages

  1. Open Webpage Text Extractor on Apify.
  2. Enter one or more web page URLs in the urls field.
  3. Choose whether to include metadata (links, headings, author info) by setting includeMetadata.
  4. Click Start and wait for the run to finish.
  5. Download the extracted text as JSON, CSV, or Excel from the Dataset tab.

How much does it cost to extract text from web pages?

Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

EventPriceDescription
Start$0.035One-time per run
URL extracted$0.001Per page extracted

Example costs:

  • 10 pages: $0.035 + 10 x $0.001 = $0.045
  • 100 pages: $0.035 + 100 x $0.001 = $0.135
  • 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035

Using the Apify API

You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('automation-lab/webpage-text-extractor').call({
urls:['https://en.wikipedia.org/wiki/Web_scraping'],
includeMetadata:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/webpage-text-extractor').call(run_input={
'urls':['https://en.wikipedia.org/wiki/Web_scraping'],
'includeMetadata':True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl-X POST "https://api.apify.com/v2/acts/automation-lab~webpage-text-extractor/runs?token=YOUR_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"urls": ["https://en.wikipedia.org/wiki/Web_scraping"],
"includeMetadata": true
}'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add--transport http apify "https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"
}
}
}

Example prompts

  • "Extract the main text content from this article: https://example.com/blog/post"
  • "Get clean text from these web pages and summarize them"
  • "How many words are on this page and what is the heading structure?"

Learn more in the Apify MCP documentation.

Integrations

Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.

Tips and best practices

  • Set includeMetadata to false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links
  • Use the headings array to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A
  • Filter by language when processing multilingual sites to route text to the correct NLP model or translation pipeline
  • Combine with Content Readability Checker to get both the raw text and readability scores for each page
  • Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export

Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

FAQ

Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.

What is the mainText field? It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.

Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.

The extracted text includes navigation menu and footer text. How do I get only the article content? The actor tries to detect the main content area using semantic HTML elements (<article>, <main>). If the website does not use these elements, the actor falls back to <body> and strips common non-content elements. Check the contentArea field in the output -- if it says "body", the site likely lacks proper semantic markup, which can cause nav/footer text to be included.

The actor returns very little or no text for a page that has content. Why? The page likely loads its content via client-side JavaScript (React, Angular, Vue, etc.). The actor uses plain HTTP requests and parses the initial HTML response without executing JavaScript. For JavaScript-heavy sites, you may need a browser-based scraping solution.

Other SEO and content tools on Apify

You might also like

Extract-any-webpage-content-for-llm

ai-developer/extract-any-webpage-content-for-llm

Fast and easy way to extract data from any webpage and are LLM friendly. The tool lets you easily extract content from any website. Ideal for researchers, marketers, and developers.

632

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

1.1K

4.1

Article Extraction API

tugelbay/article-extractor

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor

๐Ÿ‘ User avatar

Tugelbay Konabayev

41

Website Main Content Extractor

sync-network/website-main-content-extractor

Facebook Page Posts Scraper ยท No Cookies

data-slayer/facebook-page-posts

Extract posts from any Facebook page without login. Get full post text, reactions by type (like, love, care, wow, sad, angry), comments, shares, author profiles, media attachments, and timestamps. No cookies, no authentication. Export as JSON/CSV/Excel.

86

5.0

MyShipTracking Vessel Scraper

vulnv/myshiptracking-vessel-scraper

Scrapes vessel data from the MyShipTracking website based on user-defined parameters such as vessel name, build year range, speed range, and destination. The actor collects detailed ship information including vessel name, country, flag, and a direct link to the vessel's page.

Instagram Direct Messages (DMs) Automation

am_production/instagram-direct-messages-dms-automation

Cheapest Automation for Sending Instagram Direct Messages .

Webpage Text Extractor

maximedupre/webpage-text-extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

๐Ÿ‘ User avatar

Maxime Duprรฉ

2