VOOZH about

URL: https://apify.com/maximedupre/webpage-text-extractor

โ‡ฑ Webpage Text Extractor for LLMs and RAG Data ยท Apify


Pricing

from $0.50 / 1,000 extracted webpages

Go to Apify Store

Webpage Text Extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

Pricing

from $0.50 / 1,000 extracted webpages

Rating

0.0

(0)

Developer

๐Ÿ‘ Maxime Duprรฉ

Maxime Duprรฉ

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

๐Ÿ“„ Webpage text extractor for LLM-ready content

Webpage Text Extractor turns public web pages into clean text, article text, or Markdown for LLM prompts, RAG inputs, content review, and spreadsheet exports. Add one URL or a batch of URLs, choose the text shape, and the Actor returns readable page content with source metadata, headings, links, counts, redirects, and scrape timestamps.

Use it when you need the text from pages such as Example Domain, documentation pages, blog posts, help-center articles, landing pages, or public knowledge-base pages without copying each page by hand. It is built for public HTML pages that can be opened without logging in.

For a quick first run, keep the prefilled public webpage list, leave Extraction mode on Clean page text, and run the Actor. You will get a representative batch of output items that shows the full row shape before you add your own URLs.

๐Ÿงญ What this Actor does

  • Extracts clean text from public HTML web pages.
  • Supports Clean page text, Article text, and Markdown for LLMs modes.
  • Removes common page noise such as scripts, styles, navigation, headers, footers, forms, and hidden elements before extracting text.
  • Includes useful page details by default: title, meta description, author, published date, language, headings, links, canonical URL, final URL, HTTP status, word count, and character count.
  • Saves one output item per successfully extracted webpage.
  • Marks sparse but usable pages as partial so you can review them.
  • Logs skipped URLs when a page is invalid, unavailable, non-HTML, empty, private, blocked, or too slow to load.

The Actor is focused on webpage text extraction. It does not extract PDFs, Word documents, OCR from images, video transcripts, private dashboards, logged-in pages, or full rendered content from every JavaScript-heavy web app.

๐Ÿ“Š Data you can extract

Each dataset item is one successfully extracted webpage. Rows can include:

  • type - always webpage_text
  • status - ok or partial
  • inputIndex - submitted URL position
  • requestedUrl - original URL from the input
  • finalUrl - final page URL after redirects
  • canonicalUrl - canonical page URL when the page provides one
  • httpStatusCode and contentType - response details for the extracted page
  • extractionMode - cleanText, articleText, or markdown
  • title, metaDescription, author, publishedAt, and language
  • excerpt - short preview of the extracted text
  • text - main extracted text in the selected mode
  • markdown - Markdown text when Markdown mode is selected
  • wordCount and charCount
  • headings - page heading outline with level and text
  • links - visible page links with text, absolute URL, and external-link flag
  • quality - sparse-content and redirect flags
  • scrapedAt - UTC timestamp when the page was saved

You can export the dataset as JSON, CSV, Excel, XML, RSS, or HTML, or use the same output through the Apify API, schedules, webhooks, and integrations.

๐Ÿš€ How to run it

  1. Open the Input tab.
  2. Add one or more public webpage URLs in Webpage URLs.
  3. Choose Extraction mode.
  4. Keep Maximum pages small for your first run, then raise it when the output looks right.
  5. Run the Actor and open the dataset.

Use Clean page text for a general page-to-text scraper. Use Article text for blog posts, articles, and reader-style pages where the main content matters most. Use Markdown for LLMs when you want headings and links represented in Markdown for prompts, RAG ingestion, or documentation workflows.

๐Ÿงพ Input example

{
"startUrls":[
{"url":"https://example.com"},
{"url":"https://www.iana.org/domains/reserved"}
],
"extractionMode":"markdown",
"maxPages":2
}

Webpage URLs is the only required input. Add public http or https pages that can be opened without a login.

Extraction mode controls the main text format saved in text. The supported values are cleanText, articleText, and markdown.

Maximum pages caps how many submitted URLs can be extracted in one run. The public maximum is 100.

๐Ÿ“ฆ Output example

{
"type":"webpage_text",
"status":"ok",
"inputIndex":1,
"requestedUrl":"https://example.com",
"finalUrl":"https://example.com/",
"canonicalUrl":null,
"httpStatusCode":200,
"contentType":"text/html",
"extractionMode":"markdown",
"title":"Example Domain",
"metaDescription":null,
"author":null,
"publishedAt":null,
"language":"en",
"excerpt":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"text":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"markdown":"# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"wordCount":20,
"charCount":127,
"headings":[
{"level":1,"text":"Example Domain"}
],
"links":[
{
"text":"More information...",
"url":"https://www.iana.org/domains/example",
"isExternal":true
}
],
"quality":{
"isSparse":false,
"wasRedirected":true,
"reason":null
},
"scrapedAt":"2026-06-13T14:12:00.000Z"
}

๐Ÿ’ธ Pricing

This Actor uses pay-per-event pricing. You are charged for each successfully extracted webpage. Failed, invalid, unavailable, empty, or non-HTML URLs are skipped and are not saved as output items.

Current event prices are:

TierPrice per extracted webpage
FREE$0.00090
BRONZE$0.00090
SILVER$0.00070
GOLD$0.00050
PLATINUM$0.00035
DIAMOND$0.00025

There is no separate Actor-start charge in this Actor's pricing artifact.

โš ๏ธ Limits and caveats

Webpage Text Extractor works best on public HTML pages with readable content in the initial page response. Pages that require login, block access, return non-HTML files, or rely heavily on client-side rendering may produce no output item or a partial row.

The Actor does not crawl a whole website from one URL. It extracts the submitted URLs only. If you need a link map first, use a crawler to collect URLs, then pass selected pages to this Actor.

โ“ FAQ

๐Ÿงพ Can I use this as a webpage to Markdown converter?

Yes. Choose Markdown for LLMs. The main text field will contain Markdown, and the markdown field will contain the same Markdown value for easy filtering.

๐Ÿ”— Does it include links and headings?

Yes. Headings and visible links are included by default when the page provides them. You do not need to turn on separate metadata options.

๐Ÿ”’ Does it scrape private pages?

No. This Actor is for public web pages. It does not accept cookies, sessions, API keys, or login credentials.

โš ๏ธ What happens when a URL fails?

The Actor logs the skipped URL and continues with the rest of the input. Only successfully extracted pages are saved to the dataset.

๐Ÿ“ Changelog

  • 0.1: Initial release.

๐Ÿ†˜ Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h ๐Ÿซก

๐Ÿ”— Other actors

Made with โค๏ธ by Maxime Duprรฉ

You might also like

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Article Extraction API

tugelbay/article-extractor

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor

๐Ÿ‘ User avatar

Tugelbay Konabayev

44

Webpage Text Extractor

automation-lab/webpage-text-extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

๐Ÿ‘ User avatar

Stas Persiianenko

66