VOOZH about

URL: https://apify.com/constant_quadruped/site-to-agent-feed

โ‡ฑ Site to Agent Feed (URL to RAG-ready Markdown) ยท Apify


๐Ÿ‘ Site to Agent Feed (URL to RAG-ready Markdown) avatar

Site to Agent Feed (URL to RAG-ready Markdown)

Pricing

Pay per usage

Go to Apify Store

Site to Agent Feed (URL to RAG-ready Markdown)

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ CQ

CQ

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

15 hours ago

Last modified

Share

Site to Agent Feed (URL โ†’ RAG-ready Markdown)

Give it any URL(s); get back clean Markdown + structured JSON built for LLMs and AI agents โ€” main-content extraction (via trafilatura, which adapts to page layout instead of relying on brittle CSS selectors), plus title, headings, links, and a table count. Optional change-detection turns it into a site monitor.

Why

Agents and RAG pipelines want Markdown as a first-class return type (not raw HTML), and extraction that doesn't break on every redesign. Pairs well with MCP-based agent stacks.

How it works

  1. Fetches each URL's HTML over HTTP (httpx).
  2. Extracts the main content with trafilatura โ†’ Markdown + plain text. Falls back to a BeautifulSoup strip + markdownify if trafilatura returns nothing.
  3. Pulls structure (title, h1โ€“h3 headings, links, table count) with BeautifulSoup.
  4. If detectChanges is on, stores a content hash per URL and sets changed: true when it differs from the previous run.

Per-URL output

Each successfully fetched page produces a Dataset item with: url, fetched_at (UTC ISO timestamp), title, markdown, headings[] (h1โ€“h3, capped at 50), links[] ({text, href}, capped at 200), table_count, word_count, content_hash (SHA-256 of the extracted text), and (if detectChanges) changed. The raw text field is included only when outputFormat: "both". text and markdown are truncated to maxChars per page.

If a URL fails to fetch, its item is just { "url": ..., "error": ... }.

outputFormat: "markdown" (default) returns the structured item with markdown (no raw text field); "both" additionally includes the raw extracted text. markdown, headings, links, and all other structured fields are always present in both modes.

Use as a monitor

Schedule it with detectChanges: true โ€” each run flags which pages changed, so an agent only re-ingests what's new.

Limitations โ€” read this

  • Server-rendered HTML only. No JavaScript execution. It uses a plain HTTP fetch, not a browser. Single-page apps and content injected by JS will be missing or sparse. Use a browser-based scraper for those.
  • Heavily bot-protected sites return 403. Sites behind Akamai/Cloudflare-class bot protection (e.g. SEC.gov, FINRA.org) block non-browser TLS fingerprints and will fail even through residential proxy. This lightweight fetcher is for normal/server-rendered pages; use a real-browser scraper for those. Optional Apify Proxy (off by default) helps only with simple datacenter-IP blocks, not bot-protection.
  • Extraction quality depends on trafilatura. On unusual layouts it may grab too much or too little; the fallback is a coarse text strip.
  • Change-detection is whole-page hashing. Any change (including dynamic timestamps, view counters, or rotating banners) flips changed to true โ€” it does not diff what changed.
  • No anti-bot handling, JS challenges, logins, or pagination. Pages behind Cloudflare/auth or requiring clicks won't work.
  • links and headings are capped (200 / 50) and may be truncated on large pages.
  • Respects nothing beyond a basic User-Agent; you are responsible for honoring each site's terms and robots policy.

You might also like

News & Article Extractor

automation-lab/news-article-extractor

Auto-discover news/blog articles and extract clean text plus Markdown for LLM/RAG corpora. Uses RSS, sitemaps, and Readability; outputs metadata, counts, and token estimates.

๐Ÿ‘ User avatar

Stas Persiianenko

26

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Sitemap to URL Crawler โ€” Extract Sitemap.xml URLs for RAG

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

๐Ÿ‘ User avatar

Stas Persiianenko

7

Reddit Topic Watcher - Intent, Sentiment, B2B Triggers, MCP

seibs.co/reddit-topic-watcher

Reddit monitoring with intent classification (recommendation/comparison/alternative-seek/complaint), sentiment, B2B trigger scoring, author profile signals (karma, age, bot filter), competitor mention tracking, MCP-ready output. For SDR teams, brand monitoring, AI agents.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

๐Ÿ‘ User avatar

Tugelbay Konabayev

26

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

๐Ÿ‘ User avatar

Tugelbay Konabayev

12

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!