VOOZH about

URL: https://apify.com/crawlerbros/rag-web-browser

โ‡ฑ RAG Web Browser ยท Apify


Pricing

from $1.00 / 1,000 results

Go to Apify Store

Search the web or fetch direct URLs and return clean markdown for LLM/RAG pipelines. filters: domainAllowlist/Blocklist, minTextLength, keywordsAnyOf. No login, no cookies.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

8

Total users

3

Monthly active users

2 months ago

Last modified

Share

Search the web (or fetch direct URLs) and return clean markdown ready for LLM/RAG pipelines. HTTP-first with chrome131 TLS impersonation; Playwright fallback when needed. Pro filters narrow the result set to exactly what your retrieval index needs.

What this actor does

  • Sends a query to Google โ†’ extracts top organic results โ†’ fetches each โ†’ cleans HTML โ†’ emits structured records
  • Or fetches direct URLs (startUrls) and skips Google entirely
  • Strips boilerplate (nav, footer, ads, scripts) before extracting the main article
  • Outputs markdown (default), plain text, and/or raw html per page
  • Returns one clean record per page with title, description, language, word count, reading time

Output per page

  • url, loadedUrl, domain
  • title, description, languageCode
  • text (plain text) โ€” when requested
  • markdown (LLM-ready) โ€” when requested
  • html (raw cleaned HTML) โ€” when requested
  • wordCount, readingTimeMinutes (220 wpm)
  • httpStatusCode, loadedTime (seconds)
  • searchRank (1-based โ€” when the URL came from Google search)
  • recordType: "page", scrapedAt

Empty fields are omitted from the output (no nulls).

Input

FieldTypeDefaultDescription
querystring"what is retrieval augmented generation"Search query OR a single URL
startUrlsarray[]Direct URLs to fetch (skips Google search)
maxResultsint3Number of top organic Google results to fetch (1โ€“100)
outputFormatsarray["markdown"]Any combination of markdown, text, html
requestTimeoutSecsint40Per-page HTTP timeout (1โ€“300s)
scrapingToolenumraw-httpraw-http (curl_cffi) or browser-playwright
removeElementsCssSelectorstringnav/footer/aside/script/...CSS selector(s) to strip before extraction
htmlTransformerenumreadable-textreadable-text (main article) or none
desiredConcurrencyint5Parallel fetches (0 = auto)
maxRequestRetriesint2Retries on transient HTTP failures
dynamicContentWaitSecsint5Wait time for JS content (browser mode only)
removeCookieWarningsbooltrueDismiss cookie/consent dialogs (browser mode)
useApifyProxybooltrueRoute requests through Apify proxy
domainAllowlistarray[]Only emit pages whose host contains one of these substrings
domainBlocklistarray[]Drop pages whose host contains one of these substrings
minTextLengthintโ€“Drop pages with fewer than N characters of extracted text
excludeContentSelectorsarray[]Additional CSS selectors to strip
keywordsAnyOfarray[]Only emit pages containing at least one of these keywords

Example: search query

{
"query":"best vector database for RAG 2024",
"maxResults":5,
"outputFormats":["markdown"],
"minTextLength":500,
"domainBlocklist":["pinterest.com","youtube.com"]
}

Example: direct URLs

{
"startUrls":[
"https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
"https://docs.langchain.com/docs/use-cases/qa/"
],
"outputFormats":["markdown","text"],
"htmlTransformer":"readable-text"
}

Example: filter for relevance

{
"query":"vector embeddings tutorial",
"maxResults":10,
"keywordsAnyOf":["embedding","vector","similarity"],
"minTextLength":1000,
"outputFormats":["markdown"]
}

Use cases

  • RAG ingestion โ€” pull fresh top-N Google results for a topic, hand markdown to your embedder
  • News briefings โ€” daily query like "AI news today", filter by minTextLength to drop SEO thin pages
  • Competitive monitoring โ€” domainAllowlist of competitor domains, scrape their blogs weekly
  • Reference enrichment โ€” feed each citation URL from a paper into the actor for clean text extraction
  • LLM context โ€” give an LLM the cleaned markdown of pages, not raw HTML, to save tokens

FAQ

Does it require a login or cookies? No. All fetches are anonymous.

Is a proxy needed? Apify proxy is enabled by default to avoid Google rate-limits and unblock some target sites. You can disable it with useApifyProxy: false.

What's the difference between raw-http and browser-playwright? Raw HTTP uses curl_cffi with chrome131 TLS impersonation โ€” fast and works on ~80% of sites. Browser mode runs headless Chromium, waits for JS, and dismisses cookie banners โ€” slower but handles SPAs.

Why is description missing on some pages? Some pages don't expose a <meta name="description"> or og:description. The omit-empty contract drops missing fields rather than emit nulls.

Why does markdown look stripped down? We intentionally output simple markdown (headings, lists, links, emphasis, code) โ€” RAG embedders strip most formatting anyway, and simpler markdown reduces token bloat.

What if all my filters reject every result? The actor finishes cleanly with a status message instead of pushing placeholder rows.

How do I use this with my LLM/RAG pipeline? Trigger this actor from your indexing job, read the dataset (each record has url + markdown), embed the markdown, store in your vector DB.

You might also like

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

๐Ÿ‘ User avatar

Tugelbay Konabayev

12

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!