VOOZH about

URL: https://apify.com/awesome_highboy/tableforge

⇱ HTML Table to Markdown Extractor (GFM) for RAG & LLMs Β· Apify


πŸ‘ HTML Tables to Markdown (GFM) for RAG & LLMs avatar

HTML Tables to Markdown (GFM) for RAG & LLMs

Pricing

from $1.00 / 1,000 table extracteds

Go to Apify Store

HTML Tables to Markdown (GFM) for RAG & LLMs

Extract every HTML table from any URL into clean, deterministic GitHub-Flavored Markdown (GFM). Auto-detects headers (or synthesizes col1..N), escapes pipes, collapses whitespace, and stamps each table with an sha256 hash for dedup & idempotency. RAG / embeddings / LLM ready. Same HTML, same output.

Pricing

from $1.00 / 1,000 table extracteds

Rating

0.0

(0)

Developer

πŸ‘ Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

TableForge: Docs -> Queryable GFM Tables

Turn the HTML tables buried in your pages into clean, deterministic, RAG-ready GitHub-Flavored Markdown.

What it does

TableForge fetches each URL you provide, parses the returned HTML with a real DOM (jsdom), and extracts every <table> on the page. Each table is converted into a clean GitHub-Flavored Markdown (GFM) table:

  • Cell text is whitespace-collapsed and trimmed, and | characters are escaped so the Markdown stays valid.
  • If the table has a header row (<thead> or a leading <th> row), those headers are used; otherwise synthetic headers (col1, col2, ...) are generated so every table is well-formed.
  • A standard GFM header separator (| --- | --- |) is emitted, making the output ready to drop into Markdown, paste into an LLM prompt, or feed an embeddings/RAG pipeline.
  • Every table gets a deterministic content_hash (sha256: + 64 hex) computed over its GFM text, so identical tables always produce identical hashes for dedup and idempotency.

The conversion is fully deterministic: the same HTML in always yields the same Markdown and the same hash out. Nothing is summarized, rewritten, or hallucinated; missing cells are emitted as empty, never invented.

Input

FieldTypeRequiredDescription
urlsarray of stringsyesPage URLs whose tables you are authorized to extract (your own, authorized, or public pages).
ownership_attestationbooleanyesMust be true to confirm you own or are authorized to extract from these pages. If false/omitted, the run is rejected before any work and bills $0.

Output

Records are pushed to the dataset. There are two record_type values:

table β€” one record per extracted table:

FieldTypeDescription
record_typestring"table"
source_urlstringThe page the table came from.
table_indexintegerZero-based index of the table within the page.
gfm_tablestringThe full GitHub-Flavored Markdown table.
column_headersstring[]The header cells (real or synthesized col1..colN).
row_countintegerNumber of body rows (excluding the header).
content_hashstringDeterministic sha256:<64 hex> over the GFM text.

run_summary β€” one record per run:

FieldTypeDescription
record_typestring"run_summary"
pages_processedintegerPages successfully fetched and scanned.
tables_extractedintegerTotal tables converted across all pages.

A page that fails to fetch is skipped entirely (it contributes no records and is not counted). A page that fetches successfully but contains no tables is counted in pages_processed but adds no table records. Either way, pages with zero tables are never billed.

Pricing

TableForge uses Apify Pay-Per-Event. You are billed only for:

EventPrice (USD)When it fires
actor_run_start$0.005Once per run, after the ownership and paid-plan gates pass.
table_extracted$0.001Once per table successfully converted to GFM (the billed unit).

Pages are scanned but not billed (no double-charging), and failed or empty pages cost $0.

Example run: scan 10 pages that together contain 80 tables -> $0.005 (run start) + 80 x $0.001 = $0.085 total.

Why this Actor

  • Deterministic + idempotent. Output is a pure function of the input HTML. Every table carries an sha256: content hash over its Markdown, so you can dedup, cache, and detect changes reliably across runs.
  • No hallucination. Tables are parsed structurally from the DOM and reproduced faithfully. Missing cells are emitted empty, never fabricated; nothing is paraphrased or summarized.
  • Ownership attestation gate. A run cannot proceed unless you attest you are authorized to extract from the pages; without it, the run is rejected before any work with zero billing.
  • Embeddings/RAG-ready by design. Clean GFM with preserved (or synthesized) headers, escaped pipes, and per-table hashes drops straight into LLM prompts, vector stores, and Markdown docs.

This Actor is AI-authored and operated under the publisher's LLC. It uses Apify's Actor.charge() solely to bill the customer for the events above; the Actor contains no payout or money-out capability of any kind.

You might also like

HTML to Markdown

web.harvester/html-to-markdown

Convert HTML to clean Markdown. Supports GFM tables, code blocks, and custom rules. Perfect for content migration and documentation.

3

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP β€” no proxy needed.

πŸ‘ User avatar

Stas Persiianenko

21

Web Page to Markdown Extractor β€” URL to Markdown API

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.