Pricing
from $2.00 / 1,000 page scrapeds
Web Scraper RAG Ready
DeprecatedTurn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.
Pricing
from $2.00 / 1,000 page scrapeds
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
5
Total users
1
Monthly active users
4 months ago
Last modified
Categories
Share
RAG Web Scraper: The Ultimate HTML-to-Markdown Converter for LLMs
Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines.
๐ Clean Markdown vs Raw HTML
Most web scrapers return raw HTML soup or noisy text โ LLMs don't need that. This project acts as a specialized filter that extracts only the meaningful content, removes boilerplate, and outputs LLM-ready Markdown plus structured JSON you can plug directly into your AI workflows (LangChain, LlamaIndex, Pinecone, etc.).
โก Key Features
| Feature | Description |
|---|---|
| ๐งผ Clean Markdown | Removes navs, footers, ads, and cookie banners automatically. |
| ๐ง RAG Chunking | Splits content into token-sized chunks (default: 600) for Vector DBs. |
| ๐ข/โก Hybrid Mode | Starts fast (Cheerio). Auto-switches to Playwright if it detects a React/Next.js SPA. |
| ๐ก Q&A Optimized | Preserves context on StackOverflow/Discourse style pages (Question + Answer). |
| ๐ฐ Efficient Cost | Hybrid engine keeps compute units low. Pay usage fee only for results. |
๐ Why RAG Web Scraper?
| Feature | RAG Web Scraper | Standard Scraper | Full Browser Scraper |
|---|---|---|---|
| Cost | ๐ฐ Low (Hybrid) | ๐ฐ Low | ๐ธ High |
| JS Support | โ Auto-detect | โ No | โ Yes |
| Output | ๐งผ Clean Markdown | ๐๏ธ Raw HTML | ๐ Text/HTML |
| RAG Ready | โ Chunked JSON | โ No | โ No |
๐ The "Before & After" Test
Don't feed garbage to your AI. See the difference:
๐ด Standard Crawl (Raw HTML)
Contains ~50% noise: menus, scripts, footers.
<nav>Home > Docs > API</nav><divclass="cookie-banner">We use cookies! [Accept]</div><main><h1>React Hooks Guide</h1><divclass="sidebar">Join our Discord!</div><p>Hooks are a new addition in React 16.8.</p><divclass="ad-container">BUY COFFEE NOW</div></main><footer>ยฉ 2026 Meta Platforms, Inc.</footer>
Result: High token costs, potential hallucinations.
๐ข RAG Web Scraper (Markdown)
Contains 100% signal.
# Getting Started with Crawlee## InstallationInstall Crawlee using npm:npm install crawlee## Basic UsageCreate a simple crawler in just a few lines of code...
Result: Cheap embedding, accurate answers.
๐ฐ Pricing
$2.00 usage fee per 1,000 pages
We use a smart hybrid engine (Cheerio first) to keep compute costs aggressively low.
- Efficiency First: We attempt fast static extraction first.
- Power When Needed: We only launch a full browser (Playwright) if absolutely necessary.
- Fair Usage: You pay a small usage fee + standard compute units.
Why this model? It ensures you get the lowest possible price for simple sites, while guaranteeing capability for complex SPAs.
๐ Usage
1. Simple Run
Perfect for testing or small docs.
{"startUrl":"https://docs.python.org/3/","maxPages":20}
2. Advanced Run (RAG Pipeline)
Optimized for Vector Databases.
{"startUrl":"https://react.dev","maxPages":100,"includePaths":["/learn/*"],"excludePaths":["/community/*"],"chunkSize":500,"outputFormat":"json","enableChunking":true}
๐ง Apify Run Options (Memory)
If you plan to scrape more than 20 pages in a single run, it's recommended to increase memory in the Apify Run options (e.g., 2โ4 GB) to avoid timeouts and ensure stable crawling.
โ๏ธ Configuration
| Option | Type | Default | Description |
|---|---|---|---|
startUrl | String | (Required) | The URL to start crawling from. |
maxPages | Integer | 20 | Maximum number of pages to crawl. |
maxDepth | Integer | 2 | How deep to follow links (0 = start page only). |
outputFormat | String | json | json: Structured RAG chunks + metadata.markdown: Plain .md files.both: Returns both formats. |
chunkSize | Integer | 600 | Target size for chunks in tokens. Ideal for embeddings. |
includePaths | Array | [] | Only crawl URLs matching these patterns (e.g. /docs/*). |
excludePaths | Array | [] | Skip URLs matching these patterns. |
enableChunking | Boolean | true | Enable smart chunking. Disable for full-page markdown only. |
stripReferences | Boolean | true | Removes academic references/bibliography sections. |
usePlaywright | Boolean | false | Force browser rendering (auto-detected by default). |
๐ ๏ธ Technical Details
Smart Hybrid Crawling
We don't waste resources. The scraper starts in Fast Mode (Cheerio). If it detects a Single Page Application (React, Vue, Next.js), it automatically upgrades to Browser Mode (Playwright) to render the content correctly. You get the best of both worlds: speed when possible, power when needed.
Q&A Intelligence
Most scrapers flatten forums into a wall of text. We detect Q&A structures (StackOverflow, Discourse) and preserve the relationship between the Question and the Accepted Answer, ensuring your RAG system understands the context.
Noise Removal
We aggressively strip:
- Navigation bars & Mega-menus
- Footers & Legal disclaimers
- Cookie consent banners & Popups
- "Related Posts" widgets
- Academic References/Bibliographies
๐ค Output Formats
JSON (Recommended for RAG)
Returns an array of objects with metadata and chunks.
{"url":"https://example.com","title":"Page Title","markdown":"# Page Title\n\nContent...","chunks":[{"content":"Chunk 1...","tokens":450},{"content":"Chunk 2...","tokens":300}]}
Markdown Files (outputFormat: 'markdown' or 'both')
When outputFormat is set to markdown or both, the full Markdown files are stored in the Apify Key-Value Store.
Note: The default Apify dataset only contains the JSON results. To get the actual
.mdfiles:
- Go to the Key-Value Store tab in your Apify run.
- Look for keys like
OUTPUTor page-specific keys.- If using the API, target the Key-Value Store endpoint to download these raw files directly.
Markdown
Returns a single Markdown file per page (or combined), perfect for archiving or direct LLM context.
๐ FAQ
Q: Does it work on sites behind login? A: Currently designed for public documentation and content sites.
Q: How do you count pages? A: Only successfully scraped pages count. If a page fails or is skipped, you aren't charged.
Q: Can I use this with LangChain?
A: Yes! The JSON output is designed to be directly loaded into LangChain's ApifyDatasetLoader.
