VOOZH about

URL: https://apify.com/quaking_pail/ai-website-content-markdown-scraper

โ‡ฑ AI Website Content Markdown Scraper ยท Apify


๐Ÿ‘ AI Website Content Markdown Scraper avatar

AI Website Content Markdown Scraper

Pricing

$30.00 / 1,000 results

Go to Apify Store

AI Website Content Markdown Scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

Pricing

$30.00 / 1,000 results

Rating

2.3

(3)

Developer

๐Ÿ‘ AI_Builder

AI_Builder

Maintained by Community

Actor stats

31

Bookmarked

937

Total users

6

Monthly active users

5 months ago

Last modified

Categories

Share

๐Ÿ“„ Apify Actor: Markdown Website Crawler ๐Ÿง  Overview This Apify Actor crawls a website starting from a list of given URLs, performs a search using a selected search engine to find more relevant URLs within the same domain, scrapes and cleans the main content of the pages, and outputs the result in Markdown format.

It uses Selenium with a headless Chrome browser to accurately render JavaScript-heavy websites and extract readable content. Unwanted scripts, ads, headers, footers, and cookie banners are removed to ensure clean and focused output.

โš™๏ธ Input Schema The Actor accepts the following input fields:

Field Type Description start_urls Array Array of objects with a url key. These are the starting points of the crawl. max_depth Integer Maximum crawl depth (how far it should follow links from the start page). max_urls Integer Maximum number of pages to scrape in total. search_engine String (Optional) Which search engine to use to find additional URLs. One of: Google, Bing, or DuckDuckGo. Default: Google

Example input json Copier Modifier { "start_urls": [ { "url": "https://apify.com" } ], "max_depth": 1, "max_urls": 10, "search_engine": "Google" } ๐Ÿ“ค Output Format Each result pushed to the dataset contains:

Field Type Description url String The URL of the scraped page. title String The page's title (as seen in the browser tab). content String The cleaned Markdown version of the main page content.

๐Ÿ” Functionality

  1. Search Engine Discovery Uses Google, Bing, or DuckDuckGo to search for the domain.

Extracts links that belong to the same root domain.

Adds those links to the crawl queue.

  1. Crawling & Scraping Opens each valid page.

Strips unwanted elements: scripts, headers, footers, styles, iframes, videos, cookie banners.

Extracts main, article, section, and div content.

Converts the HTML to Markdown using markdownify.

  1. Cleaning Markdown Removes broken or irrelevant Markdown syntax.

Filters out image tags, inline SVGs, tracking text, and known cookie policy messages.

Trims and normalizes white space.

๐Ÿ›‘ Limitations The scraper is designed to stay within the same root domain as the starting URL.

Heavy JavaScript pages may still fail if they block bots or detect automation.

Search engine interaction is subject to changes in their HTML structure and may break over time.

๐Ÿงช Development Notes Browser automation is powered by Selenium and ChromeDriver.

Designed for use in Apify's headless actor environment with Chromium.

Requests are tracked using Apify's RequestQueue with deduplication.

๐Ÿงผ Cleanup The browser (driver.quit()) is gracefully closed at the end.

Requests are marked as handled after processing.

๐Ÿš€ Usage This Actor is ideal for:

Archiving or monitoring content changes.

SEO content extraction.

Research on company websites or competitor analysis.

You might also like

Tayara.tn vehicule scraper

gitgudless/tayara-scraper

Automatically scrape used car listings across Tunisia for market research, price tracking, and lead generation

AdouWe

2

AI Markdown Maker

onescales/bulk-ai-markdown-maker

Convert any web page into clean, AI ready markdown format in seconds. This markdown generator is perfect for content for AI models, creating documentation, or archiving web content. It intelligently parses web content, removing ads, navigation, and other clutter. Generate Markdown Today!

133

5.0

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

70

URL to markdown

apify/url-to-markdown

An Apify Actor that takes a URL as input and returns the content of the page in Markdown format.

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

๐Ÿ‘ User avatar

Tugelbay Konabayev

26

Markdown Maker: HTML to Markdown ๐Ÿ“

shahidirfan/Markdown-Maker

Instantly convert complex HTML into clean, structured Markdown. This lightweight actor is optimized to render web content into a format that is easily readable for AI LLMs, reducing token usage and improving context. Perfect for RAG pipelines and preparing data for training.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

๐Ÿš€ Transform web content into clean, LLM-ready Markdown! ๐Ÿ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! ๐ŸŒ๐Ÿ“๐Ÿง 

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

๐Ÿ‘ User avatar

Extreme Scrapes

212

5.0

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds โ€” perfect for AI training data, RAG pipelines, and content archiving.