👁 Website Content Extractor for RAG: Markdown, HTML, Text avatar

Website Content Extractor for RAG: Markdown, HTML, Text

Pricing

from $0.001 / result

👁 Website Content Extractor for RAG: Markdown, HTML, Text

Website Content Extractor for RAG: Markdown, HTML, Text

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

Pricing

from $0.001 / result

Rating

5.0

(2)

Developer

👁 nezha

nezha

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What this Actor does

Most teams do not need "a crawler." They need a faster way to turn a website into usable content for:

embeddings and chunking pipelines
internal search and AI assistants
help center or docs ingestion
markdown, text, or HTML exports that do not require manual copy-paste

This Actor helps you go from a website URL or sitemap to a structured content dataset with cleaned page text, markdown, HTML, headings, crawl metadata, and optional clean HTML records in key-value store.

Quick start

Paste a docs site, help center, blog, or website section into Website, Docs, or Help Center URLs.
Keep maxPages: 3, crawlMode: auto, and outputFormat: markdown for the first run.
Click Run.
Check the dataset and OUTPUT_SUMMARY, then raise maxPages for a larger crawl.

auto mode tries sitemap discovery first because it is usually faster for docs sites and help centers. If no crawlable sitemap pages are found, it falls back to following links from the pasted start URLs.

Use cases

Docs site to RAG
Crawl developer docs, product docs, or API docs, then export markdown or clean HTML ready for chunking, embeddings, and retrieval.

Help center to AI support
Extract support articles as clean text or markdown for internal search, support copilots, and FAQ assistants.

Website to knowledge base
Capture blog posts, product pages, and guide content as structured text with titles, headings, canonical URLs, and crawl metadata.

Output preview

Here is a simplified preview of the extracted dataset:

URL	Title	Format	Words	Language	Depth
`/academy/web-scraping-for-beginners`	Web scraping for beginners	markdown	1842	en	1
`/academy/api-integration-guide`	API integration guide	markdown	1267	en	1
`/academy/rag-pipeline-basics`	RAG pipeline basics	markdown	2135	en	1

The same record can also include:

Extra field group	Example value
Content outputs	`content`, `markdown`, `text`, `html`
Structure signals	`title`, `description`, `headings`, `canonicalUrl`
Crawl metadata	`depth`, `httpStatusCode`, `language`, `wordCount`, `crawledAt`
Clean HTML storage	`CLEAN_HTML_INDEX` plus separate clean HTML records
Run diagnostics	`OUTPUT_SUMMARY`, `FAILED_PAGES`, `SKIPPED_PAGES`

Typical fields include:

page identity: url, title, description, canonicalUrl
main content outputs: content, markdown, text, html, cleanHtml
page structure: headings
crawl metadata: contentFormat, wordCount, language, depth, httpStatusCode, crawledAt
run-level outputs: OUTPUT_SUMMARY, FAILED_PAGES, SKIPPED_PAGES, CLEAN_HTML_INDEX

Full JSON preview

If you want to inspect a more complete example record, open the preview below.

Examples

Option 1: Fast preview for RAG content

Best for a first run. It keeps cost low and returns enough pages to validate selectors, scope, and output quality.

{
"startUrls":[
{
"url":"https://docs.apify.com/academy"
}
],
"maxPages":3,
"crawlMode":"auto",
"outputFormat":"markdown",
"maxDepth":1,
"sameDomainOnly":true,
"saveCleanHtml":false
}

Option 2: Crawl directly from website pages

Best when you want to start from one section and follow links recursively.

{
"startUrls":[
{
"url":"https://docs.apify.com/academy"
}
],
"maxPages":20,
"crawlMode":"website",
"outputFormat":"markdown",
"maxDepth":2,
"sameDomainOnly":true,
"saveCleanHtml":true
}

Option 3: Crawl from sitemap URLs

Best when the target site already has a sitemap and you want broader coverage with cleaner URL discovery.

{
"startUrls":[
{
"url":"https://docs.apify.com/academy"
}
],
"maxPages":50,
"crawlMode":"sitemap",
"sitemapUrls":[
"https://docs.apify.com/sitemap.xml"
],
"maxDepth":0,
"outputFormat":"markdown",
"sameDomainOnly":true,
"saveCleanHtml":true
}

Best practices

1. Documentation into a vector database

Use the Actor to crawl product docs or API docs, then send markdown or clean HTML into your chunking and embedding pipeline.

This is useful for:

RAG systems
developer assistants
internal technical search

2. Help center into an AI support knowledge base

Use the Actor to crawl support articles and export them as markdown or text.

This is useful for:

support copilots
FAQ assistants
internal support search

3. Website content into an internal knowledge base

Use the Actor to capture blog posts, guides, and product pages in a consistent format.

This is useful for:

AI knowledge bases
content migration
website analysis and archiving

Why the dataset feels complete

This Actor does more than return a list of URLs.

You get the main content in markdown, text, and HTML.
You get structure signals such as titles, headings, descriptions, and canonical URLs.
You get crawl metadata such as word count, depth, language, status code, and crawl time.
You can store clean HTML separately for downstream parsing or chunking.
You also get run diagnostics for failed pages, skipped pages, and summary totals.

That combination makes the output useful not just for scraping, but for ingestion, QA, chunking, embeddings, search, and AI application pipelines.

API access

Developers can run this Actor programmatically through the Apify API or the Apify Python and JavaScript clients.

API reference: Apify API
Client docs: Apify clients

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Website Content Scraper avatar

Website Content Scraper

qaseemiqbal/website-content-scraper

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

Muhammad Qaseem Iqbal

👁 Website Content Extractor avatar

Website Content Extractor

taroyamada/website-content-extractor

Extract clean text and markdown from docs, pricing, product, policy, and help-center URLs for RAG datasets and content operations.

👁 User avatar

naoki anzai

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

👁 User avatar

IntelScrape

👁 Docs Change Monitor for AI avatar

Docs Change Monitor for AI

careybrown/docs-change-rag-ready-monitor

Monitor public docs, changelogs, help centers, status pages, and pricing pages for changes, then output clean Markdown and RAG-ready chunks for AI knowledge bases.

👁 User avatar

Carey Brown

👁 Web Page to Markdown Extractor avatar

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

👁 User avatar

Hanna Nosova

👁 Website Content Crawler avatar

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

👁 User avatar

Crawler Bros

👁 Website Content Crawler avatar

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

👁 User avatar

ParseForge

👁 Website Content Crawler API - Markdown for RAG avatar

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

👁 User avatar

Tugelbay Konabayev

URL: https://apify.com/nezha/website-content-crawler

⇱ Website Content Extractor for RAG: Markdown, HTML, Text · Apify

Website Content Extractor for RAG: Markdown, HTML, Text

What this Actor does

Quick start

Use cases

Output preview

Full JSON preview

Examples

Option 1: Fast preview for RAG content

Option 2: Crawl directly from website pages

Option 3: Crawl from sitemap URLs

Best practices

1. Documentation into a vector database

2. Help center into an AI support knowledge base

3. Website content into an internal knowledge base

Why the dataset feels complete

API access

You might also like

Website to Markdown Crawler for LLM & RAG

Website Content Scraper

Website Content Extractor

Docs Markdown Rag Ready Crawler

AI Web Content Crawler - Markdown for LLMs

Docs Change Monitor for AI

Web Page to Markdown Extractor

Website Content Crawler

Website Content Crawler

Website Content Crawler API - Markdown for RAG