VOOZH about

URL: https://apify.com/traorealexy/web-sraper-rag-ready

โ‡ฑ Web Scraper RAG Ready [DEPRECATED] ยท Apify


๐Ÿ‘ Web Scraper RAG Ready avatar

Web Scraper RAG Ready

Deprecated

Pricing

from $2.00 / 1,000 page scrapeds

Go to Apify Store

Web Scraper RAG Ready

Deprecated

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Pricing

from $2.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

๐Ÿ‘ Alexy Traore

Alexy Traore

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

1

Monthly active users

4 months ago

Last modified

Share

RAG Web Scraper: The Ultimate HTML-to-Markdown Converter for LLMs

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines.

๐Ÿ‘ Clean Markdown vs Raw HTML

Most web scrapers return raw HTML soup or noisy text โ€” LLMs don't need that. This project acts as a specialized filter that extracts only the meaningful content, removes boilerplate, and outputs LLM-ready Markdown plus structured JSON you can plug directly into your AI workflows (LangChain, LlamaIndex, Pinecone, etc.).


โšก Key Features

FeatureDescription
๐Ÿงผ Clean MarkdownRemoves navs, footers, ads, and cookie banners automatically.
๐Ÿง  RAG ChunkingSplits content into token-sized chunks (default: 600) for Vector DBs.
๐Ÿข/โšก Hybrid ModeStarts fast (Cheerio). Auto-switches to Playwright if it detects a React/Next.js SPA.
๐Ÿ’ก Q&A OptimizedPreserves context on StackOverflow/Discourse style pages (Question + Answer).
๐Ÿ’ฐ Efficient CostHybrid engine keeps compute units low. Pay usage fee only for results.

๐Ÿ† Why RAG Web Scraper?

FeatureRAG Web ScraperStandard ScraperFull Browser Scraper
Cost๐Ÿ’ฐ Low (Hybrid)๐Ÿ’ฐ Low๐Ÿ’ธ High
JS Supportโœ… Auto-detectโŒ Noโœ… Yes
Output๐Ÿงผ Clean Markdown๐Ÿ—‘๏ธ Raw HTML๐Ÿ“„ Text/HTML
RAG Readyโœ… Chunked JSONโŒ NoโŒ No

๐Ÿ“‰ The "Before & After" Test

Don't feed garbage to your AI. See the difference:

๐Ÿ”ด Standard Crawl (Raw HTML)

Contains ~50% noise: menus, scripts, footers.

<nav>Home > Docs > API</nav>
<divclass="cookie-banner">We use cookies! [Accept]</div>
<main>
<h1>React Hooks Guide</h1>
<divclass="sidebar">Join our Discord!</div>
<p>Hooks are a new addition in React 16.8.</p>
<divclass="ad-container">BUY COFFEE NOW</div>
</main>
<footer>ยฉ 2026 Meta Platforms, Inc.</footer>

Result: High token costs, potential hallucinations.

๐ŸŸข RAG Web Scraper (Markdown)

Contains 100% signal.

# Getting Started with Crawlee
## Installation
Install Crawlee using npm:
npm install crawlee
## Basic Usage
Create a simple crawler in just a few lines of code...

Result: Cheap embedding, accurate answers.


๐Ÿ’ฐ Pricing

$2.00 usage fee per 1,000 pages

We use a smart hybrid engine (Cheerio first) to keep compute costs aggressively low.

  • Efficiency First: We attempt fast static extraction first.
  • Power When Needed: We only launch a full browser (Playwright) if absolutely necessary.
  • Fair Usage: You pay a small usage fee + standard compute units.

Why this model? It ensures you get the lowest possible price for simple sites, while guaranteeing capability for complex SPAs.


๐Ÿš€ Usage

1. Simple Run

Perfect for testing or small docs.

{
"startUrl":"https://docs.python.org/3/",
"maxPages":20
}

2. Advanced Run (RAG Pipeline)

Optimized for Vector Databases.

{
"startUrl":"https://react.dev",
"maxPages":100,
"includePaths":["/learn/*"],
"excludePaths":["/community/*"],
"chunkSize":500,
"outputFormat":"json",
"enableChunking":true
}

๐Ÿง  Apify Run Options (Memory)

If you plan to scrape more than 20 pages in a single run, it's recommended to increase memory in the Apify Run options (e.g., 2โ€“4 GB) to avoid timeouts and ensure stable crawling.

โš™๏ธ Configuration

OptionTypeDefaultDescription
startUrlString(Required)The URL to start crawling from.
maxPagesInteger20Maximum number of pages to crawl.
maxDepthInteger2How deep to follow links (0 = start page only).
outputFormatStringjsonjson: Structured RAG chunks + metadata.
markdown: Plain .md files.
both: Returns both formats.
chunkSizeInteger600Target size for chunks in tokens. Ideal for embeddings.
includePathsArray[]Only crawl URLs matching these patterns (e.g. /docs/*).
excludePathsArray[]Skip URLs matching these patterns.
enableChunkingBooleantrueEnable smart chunking. Disable for full-page markdown only.
stripReferencesBooleantrueRemoves academic references/bibliography sections.
usePlaywrightBooleanfalseForce browser rendering (auto-detected by default).

๐Ÿ› ๏ธ Technical Details

Smart Hybrid Crawling

We don't waste resources. The scraper starts in Fast Mode (Cheerio). If it detects a Single Page Application (React, Vue, Next.js), it automatically upgrades to Browser Mode (Playwright) to render the content correctly. You get the best of both worlds: speed when possible, power when needed.

Q&A Intelligence

Most scrapers flatten forums into a wall of text. We detect Q&A structures (StackOverflow, Discourse) and preserve the relationship between the Question and the Accepted Answer, ensuring your RAG system understands the context.

Noise Removal

We aggressively strip:

  • Navigation bars & Mega-menus
  • Footers & Legal disclaimers
  • Cookie consent banners & Popups
  • "Related Posts" widgets
  • Academic References/Bibliographies

๐Ÿ“ค Output Formats

JSON (Recommended for RAG)

Returns an array of objects with metadata and chunks.

{
"url":"https://example.com",
"title":"Page Title",
"markdown":"# Page Title\n\nContent...",
"chunks":[
{"content":"Chunk 1...","tokens":450},
{"content":"Chunk 2...","tokens":300}
]
}

Markdown Files (outputFormat: 'markdown' or 'both')

When outputFormat is set to markdown or both, the full Markdown files are stored in the Apify Key-Value Store.

Note: The default Apify dataset only contains the JSON results. To get the actual .md files:

  1. Go to the Key-Value Store tab in your Apify run.
  2. Look for keys like OUTPUT or page-specific keys.
  3. If using the API, target the Key-Value Store endpoint to download these raw files directly.

Markdown

Returns a single Markdown file per page (or combined), perfect for archiving or direct LLM context.


๐Ÿ™‹ FAQ

Q: Does it work on sites behind login? A: Currently designed for public documentation and content sites.

Q: How do you count pages? A: Only successfully scraped pages count. If a page fails or is skipped, you aren't charged.

Q: Can I use this with LangChain? A: Yes! The JSON output is designed to be directly loaded into LangChain's ApifyDatasetLoader.

You might also like

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

๐Ÿ‘ User avatar

Artashes Arakelyan

7

LLM-Ready Web Scraper โ€“ RAG & Vertical Data Extraction

conceivable_extension/llm-ready-web-scraper

Scrapes any URL and returns clean LLM-ready content. Strips ads, nav, and boilerplate. Returns markdown, chunked text, token estimates, and metadata. Vertical modes for Legal, Medical, Property, E-commerce, Research, and News. Firecrawl alternative at $0.005 per URL.

1

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

AI-Ready Web Content Crawler (LLM/RAG Optimized)

brilliant_gum/web-content-crawler

Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.

๐Ÿ‘ User avatar

Yuliia Kulakova

5

RAG Web Extractor โ€” Chunked Content for AI Pipelines

junipr/rag-web-extractor

Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.