Pricing
from $3.00 / 1,000 page converted to markdowns
Web to Markdown for LLMs
Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.
Pricing
from $3.00 / 1,000 page converted to markdowns
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 months ago
Last modified
Categories
Share
Convert any URL to clean, structured markdown optimized for LLM consumption. 85% average token savings vs raw HTML. The open-source Firecrawl alternative on Apify.
Why This Actor?
LLMs choke on raw HTML. Scripts, styles, navigation, ads β all noise that burns tokens and confuses models. This actor strips all that away and returns clean markdown that your AI can actually reason about.
Raw HTML:67,841 tokens β costs $0.068 per page(GPT-4)Markdown:6,176 tokens β costs $0.006 per page(GPT-4)β 91% savings
How It Works
ββββββββββββ βββββββββββββββββββ βββββββββββββββββ Any URL ββββββΆβ Puppeteer ββββββΆβ Clean ββ β β renders page β β Markdown βββββββββββββ β (JavaScript, β β + metadata ββ SPAs, dynamic) β β + stats ββββββββββββββββββββ ββββββββββββββββββββββββββ΄βββββββββ Cheerio parses ββ Turndown ββ converts to MD ββββββββββββββββββNoise removed: scripts, styles, nav, footer, ads, popups, modalsKept: headings, paragraphs, lists, tables, links, images, code blocks
What Data Does It Extract?
| Field | Description |
|---|---|
markdown | Clean, structured markdown content |
title | Page title |
description | Meta description |
author | Article author (when available) |
publishDate | Publication date |
language | Page language |
wordCount | Total words in markdown |
links | All links found (text + href) |
images | All images (src + alt text) |
tableOfContents | Heading structure for navigation |
stats.htmlTokensEstimate | Original HTML token count |
stats.markdownTokensEstimate | Markdown token count |
stats.tokenSavingsPercent | Percentage of tokens saved |
stats.renderTimeMs | Page render time |
Use Cases
-
RAG Pipelines β Feed clean web content into vector databases (Pinecone, Weaviate, Chroma). 85% fewer tokens = 85% lower embedding costs.
-
AI Agent Tool Use β Give your agent a "read the web" tool. Pass any URL, get structured content back. Works with LangChain, LlamaIndex, CrewAI, AutoGen.
-
Content Repurposing β Convert any article/blog into markdown for your CMS, newsletter, or documentation site.
-
Training Data β Build LLM training datasets from web content. Clean markdown = higher quality training data.
-
Competitive Intelligence β Monitor competitor websites and extract structured content for analysis.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes* | β | Single URL to convert |
urls | string[] | Yes* | β | Array of URLs for batch processing |
includeLinks | boolean | No | true | Include extracted links in output |
includeImages | boolean | No | true | Include image URLs in output |
includeToc | boolean | No | false | Include table of contents |
waitFor | number | No | 3000 | Wait time (ms) for JS rendering |
*Provide either url or urls
Output Example
{"url":"https://blog.example.com/article","sourceUrl":"https://blog.example.com/article","title":"How AI Agents Read the Web","description":"A guide to building web-reading capabilities for AI agents","author":"Jane Doe","publishDate":"2026-03-25T10:00:00.000Z","language":"en","markdown":"# How AI Agents Read the Web\n\n**Author:** Jane Doe\n**Published:** 2026-03-25\n\n---\n\nAI agents need structured data to reason about web content...","wordCount":2450,"links":[{"text":"LangChain docs","href":"https://docs.langchain.com"},{"text":"Vector databases","href":"https://www.pinecone.io"}],"images":[{"src":"https://blog.example.com/diagram.png","alt":"Architecture diagram"}],"tableOfContents":[{"level":1,"text":"How AI Agents Read the Web"},{"level":2,"text":"The Problem with Raw HTML"},{"level":2,"text":"The Markdown Solution"}],"stats":{"htmlSize":245000,"markdownSize":12400,"htmlTokensEstimate":61250,"markdownTokensEstimate":3100,"tokenSavingsPercent":95,"renderTimeMs":4200}}
Performance Benchmarks
Tested across 60 diverse websites:
| Site Type | Success Rate | Avg Token Savings | Avg Time |
|---|---|---|---|
| News (BBC, CNN, NYT) | 100% | 94% | 16s |
| Blogs/Articles | 100% | 91% | 8s |
| Documentation | 100% | 92% | 5s |
| Company websites | 100% | 100% | 12s |
| Wikipedia | 100% | 73% | 7s |
| E-commerce | 80% | 90% | 10s |
| Heavy SPAs | 60% | 54% | 6s |
| Overall | 80% | 85% | 10s |
Comparison vs Firecrawl
| Feature | This Actor | Firecrawl |
|---|---|---|
| Token savings | 85% avg | 67% avg |
| Price | $0.003/page | $0.0008-0.005/page |
| JS rendering | Puppeteer (full) | Playwright |
| Free tier | Apify free plan | 500 credits |
| Open source | Yes (Apify) | Partial |
| Batch processing | Yes (urls array) | Yes |
| Standby API | Yes (instant) | Yes |
Standby API (Instant Response)
This actor supports Apify Standby mode for instant HTTP responses:
# Health checkcurl"https://george-the-developer--web-to-markdown-llm.apify.actor/"\-H"Authorization: Bearer YOUR_TOKEN"# Convert a URLcurl"https://george-the-developer--web-to-markdown-llm.apify.actor/convert?url=https://example.com"\-H"Authorization: Bearer YOUR_TOKEN"
Pricing
Pay Per Event: $0.003 per page converted
| Volume | Cost | Savings vs Firecrawl |
|---|---|---|
| 100 pages | $0.30 | β |
| 1,000 pages | $3.00 | β |
| 10,000 pages | $30.00 | β |
No monthly subscription. Pay only for what you use.
Integrations
Works with any tool that can call an HTTP API:
- LangChain: Use as a custom tool in your agent chain
- LlamaIndex: Feed markdown into document loaders
- n8n / Make: HTTP request node β markdown output
- Python:
requests.get()β JSON with markdown - Node.js:
fetch()β structured response
FAQ
Q: Does it handle JavaScript-rendered pages? A: Yes. Uses Puppeteer with full Chrome to render JavaScript, SPAs, and dynamic content.
Q: What about pages behind logins? A: Currently extracts public content only. Authenticated scraping is on the roadmap.
Q: How accurate is the token estimate? A: Uses the ~4 chars/token heuristic for English text. Actual token counts may vary by model.
Q: Can I process multiple URLs at once?
A: Yes. Pass an urls array in batch mode for multiple pages.
Support
- GitHub: the-ai-entrepreneur-ai-hub
- Apify Store: george.the.developer
- Twitter: @ai_in_it
Found a bug? Open an issue or DM on Twitter.
