VOOZH about

URL: https://apify.com/scrapeai/ai-website-content-extractor

⇱ AI Website Content Extractor Β· Apify


Pricing

$5.00/month + usage

Go to Apify Store

AI Website Content Extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

Pricing

$5.00/month + usage

Rating

5.0

(2)

Developer

πŸ‘ ScrapeAI

ScrapeAI

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

0

Monthly active users

3 months ago

Last modified

Share

Apify Actor that crawls one or more website pages using Playwright, removes navigation, ads, and other noise, then converts the main content to clean Markdown β€” ready for RAG pipelines, vector databases, and LLM training datasets.

Features

  • Crawl any public website page(s)
  • Automatically dismiss cookie / consent dialogs
  • Strip navigation bars, headers, footers, sidebars, ads, and modals
  • Detect the main content area using semantic HTML selectors (main, article, [role="main"], etc.)
  • Convert HTML to clean Markdown via turndown
  • Skip low-content pages (login walls, redirects) automatically
  • Outputs a structured dataset ready for AI use-cases

Input

FieldTypeDescriptionDefault
startUrlsArrayList of {url} objects or plain URL strings to crawl[{url: "https://example.com"}]
maxPagesNumberMaximum number of pages to process20
proxyConfigurationObjectApify proxy settings (optional){}

Example Input

{
"startUrls":[
{"url":"https://en.wikipedia.org/wiki/Artificial_intelligence"},
{"url":"https://openai.com/blog"}
],
"maxPages":10
}

Output

Each extracted page produces one dataset record:

FieldTypeDescription
urlStringURL of the crawled page
titleStringPage <title>
markdownStringClean Markdown of the main content
textString
wordCountNumberApproximate word count of the Markdown
extractedAtStringISO 8601 timestamp

Example Output

{
"url":"https://en.wikipedia.org/wiki/Artificial_intelligence",
"title":"Artificial intelligence - Wikipedia",
"markdown":"# Artificial intelligence\n\nArtificial intelligence (AI) is the simulation of human intelligence...",
"text":"Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\nLearn more",
"wordCount":4312,
"extractedAt":"2026-03-13T08:00:00.000Z"
}

Use Cases

  • RAG pipelines β€” ingest Markdown directly into your vector store
  • LLM fine-tuning β€” build clean text corpora from any website
  • AI chatbots β€” feed domain knowledge to your assistant
  • Research β€” extract and archive article content at scale

Tips

  • Supply multiple startUrls to crawl several pages in one run
  • Increase maxPages to crawl an entire site (combine with Apify's link-following features)
  • For authenticated pages, configure a proxy or session in proxyConfiguration

You might also like

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

πŸš€ Transform web content into clean, LLM-ready Markdown! πŸ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! πŸŒπŸ“πŸ§ 

Website Main Content Extractor

sync-network/website-main-content-extractor

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required β€” fast and cheap.

πŸ‘ User avatar

Daniel Dimitrov

4

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

πŸ‘ User avatar

Ahmed Jasarevic

3

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.