👁 HTML Tables to Markdown (GFM) for RAG & LLMs avatar

HTML Tables to Markdown (GFM) for RAG & LLMs

Pricing

from $1.00 / 1,000 table extracteds

👁 HTML Tables to Markdown (GFM) for RAG & LLMs

HTML Tables to Markdown (GFM) for RAG & LLMs

Extract every HTML table from any URL into clean, deterministic GitHub-Flavored Markdown (GFM). Auto-detects headers (or synthesizes col1..N), escapes pipes, collapses whitespace, and stamps each table with an sha256 hash for dedup & idempotency. RAG / embeddings / LLM ready. Same HTML, same output.

Pricing

from $1.00 / 1,000 table extracteds

Rating

0.0

(0)

Developer

👁 Adam

Adam

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

TableForge: Docs -> Queryable GFM Tables

Turn the HTML tables buried in your pages into clean, deterministic, RAG-ready GitHub-Flavored Markdown.

What it does

TableForge fetches each URL you provide, parses the returned HTML with a real DOM (jsdom), and extracts every <table> on the page. Each table is converted into a clean GitHub-Flavored Markdown (GFM) table:

Cell text is whitespace-collapsed and trimmed, and | characters are escaped so the Markdown stays valid.
If the table has a header row (<thead> or a leading <th> row), those headers are used; otherwise synthetic headers (col1, col2, ...) are generated so every table is well-formed.
A standard GFM header separator (| --- | --- |) is emitted, making the output ready to drop into Markdown, paste into an LLM prompt, or feed an embeddings/RAG pipeline.
Every table gets a deterministic content_hash (sha256: + 64 hex) computed over its GFM text, so identical tables always produce identical hashes for dedup and idempotency.

The conversion is fully deterministic: the same HTML in always yields the same Markdown and the same hash out. Nothing is summarized, rewritten, or hallucinated; missing cells are emitted as empty, never invented.

Input

Field	Type	Required	Description
`urls`	array of strings	yes	Page URLs whose tables you are authorized to extract (your own, authorized, or public pages).
`ownership_attestation`	boolean	yes	Must be `true` to confirm you own or are authorized to extract from these pages. If `false`/omitted, the run is rejected before any work and bills `$0`.

Output

Records are pushed to the dataset. There are two record_type values:

table — one record per extracted table:

Field	Type	Description
`record_type`	string	`"table"`
`source_url`	string	The page the table came from.
`table_index`	integer	Zero-based index of the table within the page.
`gfm_table`	string	The full GitHub-Flavored Markdown table.
`column_headers`	string[]	The header cells (real or synthesized `col1..colN`).
`row_count`	integer	Number of body rows (excluding the header).
`content_hash`	string	Deterministic `sha256:<64 hex>` over the GFM text.

run_summary — one record per run:

Field	Type	Description
`record_type`	string	`"run_summary"`
`pages_processed`	integer	Pages successfully fetched and scanned.
`tables_extracted`	integer	Total tables converted across all pages.

A page that fails to fetch is skipped entirely (it contributes no records and is not counted). A page that fetches successfully but contains no tables is counted in pages_processed but adds no table records. Either way, pages with zero tables are never billed.

Pricing

TableForge uses Apify Pay-Per-Event. You are billed only for:

Event	Price (USD)	When it fires
`actor_run_start`	$0.005	Once per run, after the ownership and paid-plan gates pass.
`table_extracted`	$0.001	Once per table successfully converted to GFM (the billed unit).

Pages are scanned but not billed (no double-charging), and failed or empty pages cost $0.

Example run: scan 10 pages that together contain 80 tables -> $0.005 (run start) + 80 x $0.001 = $0.085 total.

Why this Actor

Deterministic + idempotent. Output is a pure function of the input HTML. Every table carries an sha256: content hash over its Markdown, so you can dedup, cache, and detect changes reliably across runs.
No hallucination. Tables are parsed structurally from the DOM and reproduced faithfully. Missing cells are emitted empty, never fabricated; nothing is paraphrased or summarized.
Ownership attestation gate. A run cannot proceed unless you attest you are authorized to extract from the pages; without it, the run is rejected before any work with zero billing.
Embeddings/RAG-ready by design. Clean GFM with preserved (or synthesized) headers, escaped pipes, and per-table hashes drops straight into LLM prompts, vector stores, and Markdown docs.

This Actor is AI-authored and operated under the publisher's LLC. It uses Apify's Actor.charge() solely to bill the customer for the events above; the Actor contains no payout or money-out capability of any kind.

👁 HTML to Markdown avatar

HTML to Markdown

web.harvester/html-to-markdown

Convert HTML to clean Markdown. Supports GFM tables, code blocks, and custom rules. Perfect for content migration and documentation.

👁 User avatar

Web Harvester

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

👁 User avatar

Adam

👁 Web to Markdown for LLMs avatar

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

👁 User avatar

George Kioko

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

👁 HTML Table Extractor avatar

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

👁 User avatar

Stas Persiianenko

👁 Web Page to Markdown Extractor — URL to Markdown API avatar

Web Page to Markdown Extractor — URL to Markdown API

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

👁 User avatar

Hanna Nosova

👁 PDF to Markdown & JSON (RAG-Ready) avatar

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

👁 User avatar

BasisWeb

URL: https://apify.com/awesome_highboy/tableforge

⇱ HTML Table to Markdown Extractor (GFM) for RAG & LLMs · Apify

HTML Tables to Markdown (GFM) for RAG & LLMs

TableForge: Docs -> Queryable GFM Tables

What it does

Input

Output

Pricing

Why this Actor

You might also like

HTML to Markdown

Docs Markdown Rag Ready Crawler

Website to Markdown Crawler for LLM & RAG

PDF to RAG Markdown Chunks for Embeddings

Web to Markdown for LLMs

Website to Markdown for LLM and RAG

PDF URL to Markdown, Tables & RAG Extractor

HTML Table Extractor

Web Page to Markdown Extractor — URL to Markdown API

PDF to Markdown & JSON (RAG-Ready)