VOOZH about

URL: https://apify.com/wsgcjj/web-to-markdown

⇱ Web to Markdown — AI-Ready Text from Any URL · Apify


👁 Web to Markdown — AI-Ready Text from Any URL avatar

Web to Markdown — AI-Ready Text from Any URL

Pricing

from $5.00 / 1,000 results

Go to Apify Store

Web to Markdown — AI-Ready Text from Any URL

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 陈俊杰

陈俊杰

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

14 days ago

Last modified

Share

🌐 Web to Markdown Converter — Apify Actor

将任意网页URL转换为干净的Markdown格式,专为AI/LLM数据处理场景设计。

📋 功能简介

  • 一键抓取:输入URL,自动获取网页HTML
  • 智能提取:自动识别并提取主体内容(文章/主要内容区块),去除广告、导航栏、页脚、侧边栏等干扰元素
  • 干净输出:使用 markdownify 将HTML转换为标准Markdown格式
  • 可选的CSS选择器:指定特定区域进行提取
  • 错误处理完备:HTTP错误、超时、解析异常均有妥善处理

🎯 适用场景

场景说明
LLM训练数据准备将网页内容转为结构化文本供大模型训练
RAG流水线网页文档 → 向量数据库的预处理步骤
AI内容处理配合LLM进行摘要、翻译、分析等工作流
数据归档将在线文章保存为可读的纯文本格式
网页内容对比提取不同版本的页面文本进行差异分析

📥 输入参数

参数类型必填默认值说明
urlstring目标网页URL
selectorstringnullCSS选择器,指定提取的区域(如 .article-body
include_imagesbooleanfalse是否在Markdown中包含图片链接

📤 输出字段

字段类型说明
urlstring源网页URL
titlestring页面标题
markdownstring转换后的Markdown文本
word_countintegerMarkdown的单词数量
char_countintegerMarkdown的字符数量
extracted_atstring提取时间(UTC ISO 8601)
errorstring处理失败时的错误信息

🚀 快速使用

通过Apify平台

  1. 打开 Web to Markdown Converter Actor页面
  2. 点击 Run
  3. 输入目标URL,点击 Start
  4. 获取Markdown输出

通过Apify API

import requests
response = requests.post(
"https://api.apify.com/v2/acts/<username>~web-to-markdown/runs",
json={
"url":"https://en.wikipedia.org/wiki/Python_(programming_language)",
"include_images":False
}
)
print(response.json())

通过Apify SDK (Python)

from apify import Actor
asyncdefmain():
asyncwith Actor:
run_input ={
"url":"https://en.wikipedia.org/wiki/Python_(programming_language)",
"include_images":False
}
run =await Actor.call(
"username/web-to-markdown",
run_input=run_input
)
dataset =await run.dataset.get_items()
print(dataset[0]["markdown"][:500])

🛠 本地开发

前置条件

  • Python 3.14+
  • Apify CLI (npm install -g apify-cli)

本地运行

# 安装依赖
pip install-r requirements.txt
# 通过Apify CLI运行
apify run
# 或直接运行Python
python -m src

测试

# 设置环境变量
exportAPIFY_LOCAL_STORAGE_DIR=./apify_storage
# 运行
apify run

📦 技术栈

📄 许可证

MIT

You might also like

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

319

5.0

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

23

5.0

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

8

AI Markdown Maker

onescales/bulk-ai-markdown-maker

Convert any web page into clean, AI ready markdown format in seconds. This markdown generator is perfect for content for AI models, creating documentation, or archiving web content. It intelligently parses web content, removing ads, navigation, and other clutter. Generate Markdown Today!

135

5.0

AI Web-to-Markdown Extract API — URL to Clean JSON for LLMs

olican/ai-web-to-markdown-extract

Scrapes any webpage, automatically cleans HTML clutter (nav, footers, scripts, ads, cookie consent banners), and transforms the main content into clean, structured Markdown for LLMs and RAG.

👁 User avatar

Sergio Calvo

2

5.0