Site to Agent Feed (URL to RAG-ready Markdown)
Pricing
Pay per usage
Site to Agent Feed (URL to RAG-ready Markdown)
Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
Site to Agent Feed (URL โ RAG-ready Markdown)
Give it any URL(s); get back clean Markdown + structured JSON built for LLMs and AI agents โ main-content extraction (via trafilatura, which adapts to page layout instead of relying on brittle CSS selectors), plus title, headings, links, and a table count. Optional change-detection turns it into a site monitor.
Why
Agents and RAG pipelines want Markdown as a first-class return type (not raw HTML), and extraction that doesn't break on every redesign. Pairs well with MCP-based agent stacks.
How it works
- Fetches each URL's HTML over HTTP (
httpx). - Extracts the main content with trafilatura โ Markdown + plain text. Falls back to a BeautifulSoup strip + markdownify if trafilatura returns nothing.
- Pulls structure (title, h1โh3 headings, links, table count) with BeautifulSoup.
- If
detectChangesis on, stores a content hash per URL and setschanged: truewhen it differs from the previous run.
Per-URL output
Each successfully fetched page produces a Dataset item with:
url, fetched_at (UTC ISO timestamp), title, markdown, headings[] (h1โh3, capped at 50), links[] ({text, href}, capped at 200), table_count, word_count, content_hash (SHA-256 of the extracted text), and (if detectChanges) changed. The raw text field is included only when outputFormat: "both". text and markdown are truncated to maxChars per page.
If a URL fails to fetch, its item is just { "url": ..., "error": ... }.
outputFormat:"markdown"(default) returns the structured item withmarkdown(no rawtextfield);"both"additionally includes the raw extractedtext.markdown, headings, links, and all other structured fields are always present in both modes.
Use as a monitor
Schedule it with detectChanges: true โ each run flags which pages changed, so an agent only re-ingests what's new.
Limitations โ read this
- Server-rendered HTML only. No JavaScript execution. It uses a plain HTTP fetch, not a browser. Single-page apps and content injected by JS will be missing or sparse. Use a browser-based scraper for those.
- Heavily bot-protected sites return 403. Sites behind Akamai/Cloudflare-class bot protection (e.g. SEC.gov, FINRA.org) block non-browser TLS fingerprints and will fail even through residential proxy. This lightweight fetcher is for normal/server-rendered pages; use a real-browser scraper for those. Optional Apify Proxy (off by default) helps only with simple datacenter-IP blocks, not bot-protection.
- Extraction quality depends on trafilatura. On unusual layouts it may grab too much or too little; the fallback is a coarse text strip.
- Change-detection is whole-page hashing. Any change (including dynamic timestamps, view counters, or rotating banners) flips
changedto true โ it does not diff what changed. - No anti-bot handling, JS challenges, logins, or pagination. Pages behind Cloudflare/auth or requiring clicks won't work.
linksandheadingsare capped (200 / 50) and may be truncated on large pages.- Respects nothing beyond a basic User-Agent; you are responsible for honoring each site's terms and robots policy.
