VOOZH about

URL: https://apify.com/macheta/justhtml-link-to-markdown

⇱ URL to Markdown (JustHTML) - Clean Markdown Extractor Β· Apify


πŸ‘ URL to Markdown (JustHTML) - Clean Markdown Extractor avatar

URL to Markdown (JustHTML) - Clean Markdown Extractor

Pricing

Pay per usage

Go to Apify Store

URL to Markdown (JustHTML) - Clean Markdown Extractor

Convert webpages to clean Markdown for RAG and archiving. Uses JustHTML and supports optional Cloudflare/Turnstile bypass plus CSS selector extraction.

Pricing

Pay per usage

Rating

5.0

(1)

Developer

πŸ‘ Anass

Anass

Maintained by Community

Actor stats

1

Bookmarked

46

Total users

13

Monthly active users

5 months ago

Last modified

Share

Link to Markdown (JustHTML + Cloudflare Bypass)

πŸ”— URL β†’ 🧼 Clean Markdown β€’ πŸ›‘οΈ Optional bypass β€’ 🎯 CSS selector

Convert web links into clean Markdown for RAG, archiving, content pipelines, and AI agents.

This Actor fetches a URL, optionally bypasses Cloudflare challenges using the same Camoufox-based open source bypass approach in this repository, and converts the resulting HTML to Markdown using JustHTML (pure Python HTML5 parser with built-in safe output).

Keywords

link to markdown, html to markdown, webpage to markdown, url to markdown, cloudflare bypass, turnstile, anti-bot, RAG, LLM, AI agent, markdown extractor

Why this Actor (SEO)

If you need a dependable URL β†’ Markdown converter for RAG pipelines, you usually hit three problems:

  1. Broken or messy HTML that produces garbage Markdown
  2. Heavy JavaScript pages that hide the real content
  3. Anti-bot / Cloudflare interstitials that block simple fetchers

This Actor is built to be a practical extractor for AI agents, vector databases, knowledge bases, and content archiving workflows.

Common use cases

  • Convert product docs pages into Markdown for RAG
  • Build internal knowledge base snapshots from URLs
  • Extract β€œarticle” content with a CSS selector (main, article, .content)
  • Prepare clean Markdown for embedding/search indexing

Tips for better extraction

  • Set selector to target the content container (article, main, .markdown-body)
  • Use includeHtml=true only when debugging extraction
  • Keep safe=true when ingesting untrusted pages into downstream systems

What you get

  • Markdown output per URL (optionally for a specific CSS selector like article, main, or .markdown-body)
  • Safe-by-default sanitization for untrusted HTML
  • Optional Cloudflare challenge bypass fallback when normal fetching fails
  • Dataset output suitable for exporting to JSON/CSV

Input

  • urls (array) or url (string)
  • selector (string, optional)
  • safe (boolean, default: true)
  • useCloudflareBypass (boolean, default: true)
  • bypassCache (boolean, default: false)
  • proxyUrl (string, optional)
  • includeHtml (boolean, default: false)
  • maxConcurrency (int, default: 2)

Output (dataset items)

Each item contains:

  • url, finalUrl
  • status (success or failed)
  • title
  • markdown
  • statusCode, contentType
  • bypassed (boolean)
  • error (string, if failed)

Example input

{
"urls":[
"https://github.com/EmilStenstrom/justhtml"
],
"selector":".markdown-body",
"safe":true,
"useCloudflareBypass":true
}

Deploy to Apify

  1. Install Apify CLI and log in
  2. From this Actor directory, run:
$apify push

Then publish from the Apify Console with a title/description similar to this README for strong discoverability:

  • Keywords: link to markdown, html to markdown, justhtml, cloudflare bypass, turnstile, RAG

Licensing

  • This Actor’s code in this repository follows the repository’s license.
  • JustHTML is vendored under and distributed under its own license (see its LICENSE file).

You might also like

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Markdown Anything β€” URL to Markdown

s-r/markdown-anything

Convert any URL to clean markdown using a 3-provider fallback chain. Batch input, high concurrency.

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

70

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Convert To Markdown

datavault/convert-to-markdown

Convert to Markdown, converts documents, spreadsheets, images (OCR), audio (transcription), and web/data files into clean Markdown. It runs fully locally, requires no API keys, and is ideal for LLMs, docs, and archiving.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.