Webpage Text Extractor

Pricing

from $0.50 / 1,000 extracted webpages

Webpage Text Extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

Pricing

from $0.50 / 1,000 extracted webpages

Rating

0.0

(0)

Developer

👁 Maxime Dupré

Maxime Dupré

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

📄 Webpage text extractor for LLM-ready content

Webpage Text Extractor turns public web pages into clean text, article text, or Markdown for LLM prompts, RAG inputs, content review, and spreadsheet exports. Add one URL or a batch of URLs, choose the text shape, and the Actor returns readable page content with source metadata, headings, links, counts, redirects, and scrape timestamps.

Use it when you need the text from pages such as Example Domain, documentation pages, blog posts, help-center articles, landing pages, or public knowledge-base pages without copying each page by hand. It is built for public HTML pages that can be opened without logging in.

For a quick first run, keep the prefilled public webpage list, leave Extraction mode on Clean page text, and run the Actor. You will get a representative batch of output items that shows the full row shape before you add your own URLs.

🧭 What this Actor does

Extracts clean text from public HTML web pages.
Supports Clean page text, Article text, and Markdown for LLMs modes.
Removes common page noise such as scripts, styles, navigation, headers, footers, forms, and hidden elements before extracting text.
Includes useful page details by default: title, meta description, author, published date, language, headings, links, canonical URL, final URL, HTTP status, word count, and character count.
Saves one output item per successfully extracted webpage.
Marks sparse but usable pages as partial so you can review them.
Logs skipped URLs when a page is invalid, unavailable, non-HTML, empty, private, blocked, or too slow to load.

The Actor is focused on webpage text extraction. It does not extract PDFs, Word documents, OCR from images, video transcripts, private dashboards, logged-in pages, or full rendered content from every JavaScript-heavy web app.

📊 Data you can extract

Each dataset item is one successfully extracted webpage. Rows can include:

type - always webpage_text
status - ok or partial
inputIndex - submitted URL position
requestedUrl - original URL from the input
finalUrl - final page URL after redirects
canonicalUrl - canonical page URL when the page provides one
httpStatusCode and contentType - response details for the extracted page
extractionMode - cleanText, articleText, or markdown
title, metaDescription, author, publishedAt, and language
excerpt - short preview of the extracted text
text - main extracted text in the selected mode
markdown - Markdown text when Markdown mode is selected
wordCount and charCount
headings - page heading outline with level and text
links - visible page links with text, absolute URL, and external-link flag
quality - sparse-content and redirect flags
scrapedAt - UTC timestamp when the page was saved

You can export the dataset as JSON, CSV, Excel, XML, RSS, or HTML, or use the same output through the Apify API, schedules, webhooks, and integrations.

🚀 How to run it

Open the Input tab.
Add one or more public webpage URLs in Webpage URLs.
Choose Extraction mode.
Keep Maximum pages small for your first run, then raise it when the output looks right.
Run the Actor and open the dataset.

Use Clean page text for a general page-to-text scraper. Use Article text for blog posts, articles, and reader-style pages where the main content matters most. Use Markdown for LLMs when you want headings and links represented in Markdown for prompts, RAG ingestion, or documentation workflows.

🧾 Input example

{
"startUrls":[
{"url":"https://example.com"},
{"url":"https://www.iana.org/domains/reserved"}
],
"extractionMode":"markdown",
"maxPages":2
}

Webpage URLs is the only required input. Add public http or https pages that can be opened without a login.

Extraction mode controls the main text format saved in text. The supported values are cleanText, articleText, and markdown.

Maximum pages caps how many submitted URLs can be extracted in one run. The public maximum is 100.

📦 Output example

{
"type":"webpage_text",
"status":"ok",
"inputIndex":1,
"requestedUrl":"https://example.com",
"finalUrl":"https://example.com/",
"canonicalUrl":null,
"httpStatusCode":200,
"contentType":"text/html",
"extractionMode":"markdown",
"title":"Example Domain",
"metaDescription":null,
"author":null,
"publishedAt":null,
"language":"en",
"excerpt":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"text":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"markdown":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"wordCount":20,
"charCount":127,
"headings":[
{"level":1,"text":"Example Domain"}
],
"links":[
{
"text":"More information...",
"url":"https://www.iana.org/domains/example",
"isExternal":true
}
],
"quality":{
"isSparse":false,
"wasRedirected":true,
"reason":null
},
"scrapedAt":"2026-06-13T14:12:00.000Z"
}

💸 Pricing

This Actor uses pay-per-event pricing. You are charged for each successfully extracted webpage. Failed, invalid, unavailable, empty, or non-HTML URLs are skipped and are not saved as output items.

Current event prices are:

Tier	Price per extracted webpage
FREE	$0.00090
BRONZE	$0.00090
SILVER	$0.00070
GOLD	$0.00050
PLATINUM	$0.00035
DIAMOND	$0.00025

There is no separate Actor-start charge in this Actor's pricing artifact.

⚠️ Limits and caveats

Webpage Text Extractor works best on public HTML pages with readable content in the initial page response. Pages that require login, block access, return non-HTML files, or rely heavily on client-side rendering may produce no output item or a partial row.

The Actor does not crawl a whole website from one URL. It extracts the submitted URLs only. If you need a link map first, use a crawler to collect URLs, then pass selected pages to this Actor.

❓ FAQ

🧾 Can I use this as a webpage to Markdown converter?

Yes. Choose Markdown for LLMs. The main text field will contain Markdown, and the markdown field will contain the same Markdown value for easy filtering.

🔗 Does it include links and headings?

Yes. Headings and visible links are included by default when the page provides them. You do not need to turn on separate metadata options.

🔒 Does it scrape private pages?

No. This Actor is for public web pages. It does not accept cookies, sessions, API keys, or login credentials.

⚠️ What happens when a URL fails?

The Actor logs the skipped URL and continues with the rest of the input. Only successfully extracted pages are saved to the dataset.

📝 Changelog

0.1: Initial release.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Website URL Crawler ↗ - Crawl public websites and export a clean link map before extracting selected page text.
Web Images Scraper ↗ - Extract image URLs, metadata, and optional image files from public webpages.
Website Emails Scraper ↗ - Find public contact emails from websites and keep the source URL attached.
Font Detector ↗ - Detect web fonts, CSS font families, and font source evidence from public pages.
Ahrefs Free Website Stats Scraper ↗ - Extract public Ahrefs website metrics for SEO research and website audits.

Made with ❤️ by Maxime Dupré

Web Text Extractor

rl1987/web-text-extractor

👁 User avatar

R.L.

👁 Web Page to Markdown Extractor avatar

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

👁 User avatar

Hanna Nosova

👁 Smart Article & Blog Extractor avatar

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

👁 User avatar

Lightkong

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Article Extraction API avatar

Article Extraction API

tugelbay/article-extractor

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor

👁 User avatar

Tugelbay Konabayev

👁 Webpage Text Extractor avatar

Webpage Text Extractor

automation-lab/webpage-text-extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

👁 User avatar

Stas Persiianenko

Website to Markdown Converter for LLM Training

pink_comic/website-content-to-markdown

Convert any web page to clean Markdown. Strips nav, ads, scripts, styling. Preserves headings, lists, tables, code blocks, links. Perfect for LLM training data, RAG pipelines, content migration, documentation archival, and text analysis. Bulk processing with word/link/image counts.

👁 User avatar

Ava Torres

Webpage To Clean Markdown

technicaldost/webpage-to-clean-markdown

👁 User avatar

Technical Dost Solutions

Web Page to Clean Markdown

consistent_tradition/web-to-markdown

Extracts clean Markdown text from any web page. Perfect for AI/RAG datasets, research corpora, and content analysis.

👁 User avatar

Peter PANG

PDF Extractor: Structured Text + Metadata

aitoolbreakdown/atb-pdf-extractor

Point it at one or many PDF URLs. Get clean structured JSON back: full text, per-page text, title, author, page count, and word count. Ready for RAG, search, or doc automation.

👁 User avatar

AI Tool Breakdown

URL: https://apify.com/maximedupre/webpage-text-extractor

⇱ Webpage Text Extractor for LLMs and RAG Data · Apify

Webpage Text Extractor

📄 Webpage text extractor for LLM-ready content

🧭 What this Actor does

📊 Data you can extract

🚀 How to run it

🧾 Input example

📦 Output example

💸 Pricing

⚠️ Limits and caveats

❓ FAQ

🧾 Can I use this as a webpage to Markdown converter?

🔗 Does it include links and headings?

🔒 Does it scrape private pages?

⚠️ What happens when a URL fails?

📝 Changelog

🆘 Support

🔗 Other actors

You might also like

Web Text Extractor

Web Page to Markdown Extractor

Smart Article & Blog Extractor

Website to Markdown Crawler for LLM & RAG

Article Extraction API

Webpage Text Extractor

Website to Markdown Converter for LLM Training

Webpage To Clean Markdown

Web Page to Clean Markdown

PDF Extractor: Structured Text + Metadata