VOOZH about

URL: https://glama.ai/mcp/servers/search/web-scraping-tools-and-techniques

⇱ Web scraping tools and techniques | Glama


Search for:

Web scraping tools and techniques

View all MCP Servers

  • Why this server?

    This server is an excellent fit as its primary function is to 'scrape and extract data from any website' globally, specifically mentioning bypassing anti-bot systems and rendering JavaScript content, which directly addresses the user's need for web scraping (网页爬取).

    -
    license
    -
    quality
    -
    maintenance
    Enables AI models to scrape and extract data from any website globally using Thordata's 195+ country proxy network. Bypasses anti-bot systems and renders JavaScript content, outputting structured data in Markdown, HTML, or Links format.
    Last updated
  • Why this server?

    This tool explicitly enables 'scraping and extraction' of data from websites, covering single-page scraping and multi-page crawling with rendering capabilities, making it a strong match for web scraping needs.

    A
    license
    -
    quality
    C
    maintenance
    Enables web scraping and crawling capabilities for LLM clients, supporting single-page scraping, multi-page website crawling, and web search with multiple engines (Playwright, Cheerio, Puppeteer) and flexible output formats including markdown, HTML, text, and screenshots.
    Last updated
    11
    6
    MIT
  • Why this server?

    This server focuses on 'browser automation and web content extraction' using Playwright, a core technology for performing reliable web scraping tasks.

    F
    license
    -
    quality
    D
    maintenance
    Enables browser automation, web content extraction, and LLM-powered data transformation using Playwright. Supports session management, authentication flows, and works with local LLMs (Ollama, JAN AI) or external providers to clean and structure extracted web data.
    Last updated
    55
    6
  • Why this server?

    This server uses 'Tavily's Search and Crawl APIs to gather and structure data,' which aligns directly with the goal of web crawling and extracting information (网页爬取).

    -
    license
    B
    quality
    -
    maintenance
    A Model Context Protocol compliant server that facilitates comprehensive web research by utilizing Tavily's Search and Crawl APIs to gather and structure data for high-quality markdown document creation.
    Last updated
    1
    57
    12
  • Why this server?

    A production-ready server that provides AI-powered 'web scraping capabilities,' transforming webpages to markdown and extracting structured data, which is highly relevant to the search query.

  • Why this server?

    This server specializes in extracting and transforming 'webpage content into clean, LLM-optimized Markdown,' a crucial step in preparing scraped data for analysis.

    A
    license
    A
    quality
    D
    maintenance
    Extracts and transforms webpage content into clean, LLM-optimized Markdown. Returns article title, main content, excerpt, byline and site name. Uses Mozilla's Readability algorithm to remove ads, navigation, footers and non-essential elements while preserving the core content structure.
    Last updated
    1
    36
    17
    MIT
  • Why this server?

    Enables 'reverse engineering of web applications' and interactions through browser automation, which are advanced techniques used for deep web data harvesting.

    A
    license
    A
    quality
    D
    maintenance
    Enables reverse engineering of web applications and chat interfaces through browser automation, network traffic capture, and streaming API discovery. Provides comprehensive tools for analyzing network patterns, capturing streaming responses, and automating complex web interactions.
    Last updated
    14
    2
    1
    ISC
  • Why this server?

    This server enables LLMs to perform 'browser automation and web page interactions' using Playwright, a tool frequently used for web scraping and data extraction from dynamic sites.

    A
    license
    -
    quality
    D
    maintenance
    Enables LLMs to perform browser automation and web page interactions using Playwright's accessibility tree instead of screenshots. Provides fast, deterministic web automation through structured data without requiring vision models.
    Last updated
    5,659,017
    Apache 2.0
  • Why this server?

    A versatile tool for generalized 'fetching content from URLs' (HTML, JSON, text), providing the basic necessary functionality for web data retrieval.

    A
    license
    A
    quality
    D
    maintenance
    A Model Context Protocol (MCP) server that enables Claude or other LLMs to fetch content from URLs, supporting HTML, JSON, text, and images with configurable request parameters.
    Last updated
    3
    3
    MIT