RAG Web Browser

Pricing

from $1.00 / 1,000 results

RAG Web Browser

Search the web or fetch direct URLs and return clean markdown for LLM/RAG pipelines. filters: domainAllowlist/Blocklist, minTextLength, keywordsAnyOf. No login, no cookies.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What this actor does

Sends a query to Google → extracts top organic results → fetches each → cleans HTML → emits structured records
Or fetches direct URLs (startUrls) and skips Google entirely
Strips boilerplate (nav, footer, ads, scripts) before extracting the main article
Outputs markdown (default), plain text, and/or raw html per page
Returns one clean record per page with title, description, language, word count, reading time

Output per page

url, loadedUrl, domain
title, description, languageCode
text (plain text) — when requested
markdown (LLM-ready) — when requested
html (raw cleaned HTML) — when requested
wordCount, readingTimeMinutes (220 wpm)
httpStatusCode, loadedTime (seconds)
searchRank (1-based — when the URL came from Google search)
recordType: "page", scrapedAt

Empty fields are omitted from the output (no nulls).

Input

Field	Type	Default	Description
`query`	string	`"what is retrieval augmented generation"`	Search query OR a single URL
`startUrls`	array	`[]`	Direct URLs to fetch (skips Google search)
`maxResults`	int	`3`	Number of top organic Google results to fetch (1–100)
`outputFormats`	array	`["markdown"]`	Any combination of `markdown`, `text`, `html`
`requestTimeoutSecs`	int	`40`	Per-page HTTP timeout (1–300s)
`scrapingTool`	enum	`raw-http`	`raw-http` (curl_cffi) or `browser-playwright`
`removeElementsCssSelector`	string	nav/footer/aside/script/...	CSS selector(s) to strip before extraction
`htmlTransformer`	enum	`readable-text`	`readable-text` (main article) or `none`
`desiredConcurrency`	int	`5`	Parallel fetches (0 = auto)
`maxRequestRetries`	int	`2`	Retries on transient HTTP failures
`dynamicContentWaitSecs`	int	`5`	Wait time for JS content (browser mode only)
`removeCookieWarnings`	bool	`true`	Dismiss cookie/consent dialogs (browser mode)
`useApifyProxy`	bool	`true`	Route requests through Apify proxy
`domainAllowlist`	array	`[]`	Only emit pages whose host contains one of these substrings
`domainBlocklist`	array	`[]`	Drop pages whose host contains one of these substrings
`minTextLength`	int	–	Drop pages with fewer than N characters of extracted text
`excludeContentSelectors`	array	`[]`	Additional CSS selectors to strip
`keywordsAnyOf`	array	`[]`	Only emit pages containing at least one of these keywords

Example: search query

{
"query":"best vector database for RAG 2024",
"maxResults":5,
"outputFormats":["markdown"],
"minTextLength":500,
"domainBlocklist":["pinterest.com","youtube.com"]
}

Example: direct URLs

{
"startUrls":[
"https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
"https://docs.langchain.com/docs/use-cases/qa/"
],
"outputFormats":["markdown","text"],
"htmlTransformer":"readable-text"
}

Example: filter for relevance

{
"query":"vector embeddings tutorial",
"maxResults":10,
"keywordsAnyOf":["embedding","vector","similarity"],
"minTextLength":1000,
"outputFormats":["markdown"]
}

Use cases

RAG ingestion — pull fresh top-N Google results for a topic, hand markdown to your embedder
News briefings — daily query like "AI news today", filter by minTextLength to drop SEO thin pages
Competitive monitoring — domainAllowlist of competitor domains, scrape their blogs weekly
Reference enrichment — feed each citation URL from a paper into the actor for clean text extraction
LLM context — give an LLM the cleaned markdown of pages, not raw HTML, to save tokens

FAQ

Does it require a login or cookies? No. All fetches are anonymous.

Is a proxy needed? Apify proxy is enabled by default to avoid Google rate-limits and unblock some target sites. You can disable it with useApifyProxy: false.

What's the difference between raw-http and browser-playwright? Raw HTTP uses curl_cffi with chrome131 TLS impersonation — fast and works on ~80% of sites. Browser mode runs headless Chromium, waits for JS, and dismisses cookie banners — slower but handles SPAs.

Why is description missing on some pages? Some pages don't expose a <meta name="description"> or og:description. The omit-empty contract drops missing fields rather than emit nulls.

Why does markdown look stripped down? We intentionally output simple markdown (headings, lists, links, emphasis, code) — RAG embedders strip most formatting anyway, and simpler markdown reduces token bloat.

What if all my filters reject every result? The actor finishes cleanly with a status message instead of pushing placeholder rows.

How do I use this with my LLM/RAG pipeline? Trigger this actor from your indexing job, read the dataset (each record has url + markdown), embed the markdown, store in your vector DB.

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

👁 User avatar

NexGenData

👁 RAG Web Browser avatar

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

👁 User avatar

Apify

108K

3.7

RAG Web Browser

scrapio/rag-web-browser

👁 User avatar

Scrapio

RAG Web Browser

api-empire/rag-web-browser

👁 User avatar

API Empire

RAG Web Browser

scraper-engine/rag-web-browser

👁 User avatar

Scraper Engine

RAG Web Browser

simpleapi/rag-web-browser

👁 User avatar

SimpleAPI

👁 RAG Web Browser API - Search & Extract avatar

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

👁 User avatar

Tugelbay Konabayev

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 RAG Web Browser avatar

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!

👁 User avatar

ParseForge

URL: https://apify.com/crawlerbros/rag-web-browser

⇱ RAG Web Browser · Apify

RAG Web Browser

What this actor does

Output per page

Input

Example: search query

Example: direct URLs

Example: filter for relevance

Use cases

FAQ

You might also like

🧠 RAG Web Browser — Web Content for AI & LLMs

RAG Web Browser

RAG Web Browser

RAG Web Browser

RAG Web Browser

RAG Web Browser

RAG Web Browser API - Search & Extract

Web-to-Markdown Generator for AI & RAG Pipelines

Website to Markdown Crawler for LLM & RAG

RAG Web Browser