Web Scraper RAG Ready

Deprecated

Pricing

from $2.00 / 1,000 page scrapeds

See alternative Actors

Go to Apify Store

👁 Web Scraper RAG Ready

Web Scraper RAG Ready

Deprecated

See alternative Actors

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Pricing

from $2.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

👁 Alexy Traore

Alexy Traore

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

RAG Web Scraper: The Ultimate HTML-to-Markdown Converter for LLMs

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines.

👁 Clean Markdown vs Raw HTML

Most web scrapers return raw HTML soup or noisy text — LLMs don't need that. This project acts as a specialized filter that extracts only the meaningful content, removes boilerplate, and outputs LLM-ready Markdown plus structured JSON you can plug directly into your AI workflows (LangChain, LlamaIndex, Pinecone, etc.).

⚡ Key Features

Feature	Description
🧼 Clean Markdown	Removes navs, footers, ads, and cookie banners automatically.
🧠 RAG Chunking	Splits content into token-sized chunks (default: 600) for Vector DBs.
🐢/⚡ Hybrid Mode	Starts fast (Cheerio). Auto-switches to Playwright if it detects a React/Next.js SPA.
💡 Q&A Optimized	Preserves context on StackOverflow/Discourse style pages (Question + Answer).
💰 Efficient Cost	Hybrid engine keeps compute units low. Pay usage fee only for results.

🏆 Why RAG Web Scraper?

Feature	RAG Web Scraper	Standard Scraper	Full Browser Scraper
Cost	💰 Low (Hybrid)	💰 Low	💸 High
JS Support	✅ Auto-detect	❌ No	✅ Yes
Output	🧼 Clean Markdown	🗑️ Raw HTML	📄 Text/HTML
RAG Ready	✅ Chunked JSON	❌ No	❌ No

📉 The "Before & After" Test

Don't feed garbage to your AI. See the difference:

🔴 Standard Crawl (Raw HTML)

Contains ~50% noise: menus, scripts, footers.

<nav>Home > Docs > API</nav>
<divclass="cookie-banner">We use cookies! [Accept]</div>
<main>
<h1>React Hooks Guide</h1>
<divclass="sidebar">Join our Discord!</div>
<p>Hooks are a new addition in React 16.8.</p>
<divclass="ad-container">BUY COFFEE NOW</div>
</main>
<footer>© 2026 Meta Platforms, Inc.</footer>

Result: High token costs, potential hallucinations.

🟢 RAG Web Scraper (Markdown)

Contains 100% signal.

# Getting Started with Crawlee
## Installation
Install Crawlee using npm:
npm install crawlee
## Basic Usage
Create a simple crawler in just a few lines of code...

Result: Cheap embedding, accurate answers.

💰 Pricing

$2.00 usage fee per 1,000 pages

We use a smart hybrid engine (Cheerio first) to keep compute costs aggressively low.

Efficiency First: We attempt fast static extraction first.
Power When Needed: We only launch a full browser (Playwright) if absolutely necessary.
Fair Usage: You pay a small usage fee + standard compute units.

Why this model? It ensures you get the lowest possible price for simple sites, while guaranteeing capability for complex SPAs.

🚀 Usage

1. Simple Run

Perfect for testing or small docs.

{
"startUrl":"https://docs.python.org/3/",
"maxPages":20
}

2. Advanced Run (RAG Pipeline)

Optimized for Vector Databases.

{
"startUrl":"https://react.dev",
"maxPages":100,
"includePaths":["/learn/*"],
"excludePaths":["/community/*"],
"chunkSize":500,
"outputFormat":"json",
"enableChunking":true
}

🧠 Apify Run Options (Memory)

If you plan to scrape more than 20 pages in a single run, it's recommended to increase memory in the Apify Run options (e.g., 2–4 GB) to avoid timeouts and ensure stable crawling.

⚙️ Configuration

Option	Type	Default	Description
`startUrl`	String	(Required)	The URL to start crawling from.
`maxPages`	Integer	20	Maximum number of pages to crawl.
`maxDepth`	Integer	2	How deep to follow links (0 = start page only).
`outputFormat`	String	`json`	`json`: Structured RAG chunks + metadata. `markdown`: Plain .md files. `both`: Returns both formats.
`chunkSize`	Integer	600	Target size for chunks in tokens. Ideal for embeddings.
`includePaths`	Array	`[]`	Only crawl URLs matching these patterns (e.g. `/docs/*`).
`excludePaths`	Array	`[]`	Skip URLs matching these patterns.
`enableChunking`	Boolean	`true`	Enable smart chunking. Disable for full-page markdown only.
`stripReferences`	Boolean	`true`	Removes academic references/bibliography sections.
`usePlaywright`	Boolean	`false`	Force browser rendering (auto-detected by default).

🛠️ Technical Details

Smart Hybrid Crawling

We don't waste resources. The scraper starts in Fast Mode (Cheerio). If it detects a Single Page Application (React, Vue, Next.js), it automatically upgrades to Browser Mode (Playwright) to render the content correctly. You get the best of both worlds: speed when possible, power when needed.

Q&A Intelligence

Most scrapers flatten forums into a wall of text. We detect Q&A structures (StackOverflow, Discourse) and preserve the relationship between the Question and the Accepted Answer, ensuring your RAG system understands the context.

Noise Removal

We aggressively strip:

Navigation bars & Mega-menus
Footers & Legal disclaimers
Cookie consent banners & Popups
"Related Posts" widgets
Academic References/Bibliographies

📤 Output Formats

JSON (Recommended for RAG)

Returns an array of objects with metadata and chunks.

{
"url":"https://example.com",
"title":"Page Title",
"markdown":"# Page Title\n\nContent...",
"chunks":[
{"content":"Chunk 1...","tokens":450},
{"content":"Chunk 2...","tokens":300}
]
}

Markdown Files (outputFormat: 'markdown' or 'both')

When outputFormat is set to markdown or both, the full Markdown files are stored in the Apify Key-Value Store.

Note: The default Apify dataset only contains the JSON results. To get the actual .md files:

Go to the Key-Value Store tab in your Apify run.

Look for keys like OUTPUT or page-specific keys.

If using the API, target the Key-Value Store endpoint to download these raw files directly.

Markdown

Returns a single Markdown file per page (or combined), perfect for archiving or direct LLM context.

🙋 FAQ

Q: Does it work on sites behind login? A: Currently designed for public documentation and content sites.

Q: How do you count pages? A: Only successfully scraped pages count. If a page fails or is skipped, you aren't charged.

Q: Can I use this with LangChain? A: Yes! The JSON output is designed to be directly loaded into LangChain's ApifyDatasetLoader.

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

👁 LLM-Ready Web Scraper – RAG & Vertical Data Extraction avatar

LLM-Ready Web Scraper – RAG & Vertical Data Extraction

conceivable_extension/llm-ready-web-scraper

Scrapes any URL and returns clean LLM-ready content. Strips ads, nav, and boilerplate. Returns markdown, chunked text, token estimates, and metadata. Vertical modes for Legal, Medical, Property, E-commerce, Research, and News. Firecrawl alternative at $0.005 per URL.

👁 User avatar

joseph fadero

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 AI Training Data Scraper - LLM and RAG-Ready avatar

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

👁 User avatar

George Kioko

👁 AI-Ready Web Content Crawler (LLM/RAG Optimized) avatar

AI-Ready Web Content Crawler (LLM/RAG Optimized)

brilliant_gum/web-content-crawler

Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.

👁 User avatar

Yuliia Kulakova

AI Web Scraper — URL to JSON with Confidence

crisp_gopher/ai-scraper-to-json

Extract structured data from any website into typed JSON matching your schema, with a confidence score on every field. AI-powered, RAG-ready, with built-in schema validation and grounding to catch hallucinations.

👁 User avatar

Emploice Mushwashans

Metascraper — Web Metadata Extractor

ntriqpro/metascraper-actor

Extract structured metadata (title, description, author, image, publisher, date) from any web page using the metascraper library.

👁 User avatar

daehwan kim

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

👁 User avatar

batuhan senavci

👁 RAG Web Extractor — Chunked Content for AI Pipelines avatar

RAG Web Extractor — Chunked Content for AI Pipelines

junipr/rag-web-extractor

Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.

👁 User avatar

junipr

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

👁 User avatar

Ale

URL: https://apify.com/traorealexy/web-sraper-rag-ready