VOOZH about

URL: https://apify.com/apify/beautifulsoup-scraper

โ‡ฑ BeautifulSoup Scraper ยท Apify


Pricing

Pay per usage

Go to Apify Store

BeautifulSoup Scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Pricing

Pay per usage

Rating

5.0

(6)

Developer

๐Ÿ‘ Apify

Apify

Maintained by Apify

Actor stats

11

Bookmarked

1K

Total users

23

Monthly active users

23 days ago

Last modified

Share

Beautifulsoup Scraper crawls websites using plain HTTP requests (no browser) and lets you extract data from each page with your own Python code, powered by the BeautifulSoup library. It's the Python alternative to Cheerio Scraper and is ideal for sites that don't rely on client-side JavaScript.

How it works

You give the scraper two things: where to start and how to extract data.

  1. It adds your Start URLs to the crawling queue.
  2. It fetches each URL and builds a BeautifulSoup DOM from the HTML.
  3. It runs your Page function on the page and stores the returned data.
  4. Optionally, it follows links matching your Link selector / Link patterns and enqueues them for recursive crawling.

Page function

Python code run for every page. It receives a BeautifulSoupCrawlingContext and returns the data to store:

from typing import Any
from crawlee.crawlers import BeautifulSoupCrawlingContext
defpage_function(context: BeautifulSoupCrawlingContext)-> Any:
return{
'url': context.request.url,
'title': context.soup.title.string if context.soup.title elseNone,
}

The code runs on Python 3.14 and may only import modules already installed in the Actor.

Proxy configuration

A proxy is required. Set proxyConfiguration to use Apify Proxy (automatic or selected groups) or your own custom proxy URLs:

{
"useApifyProxy": true, // use Apify Proxy
"apifyProxyGroups": [], // optional: specific groups
"proxyUrls": [] // or custom "scheme://user:pass@host:port" URLs
}

Output

Results returned by your page function land in the run's default dataset. Download them as JSON, CSV, XML, or Excel from Apify Console, or via the API:

https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json&clean=true

Limitations

The Actor uses raw HTTP requests, so it can't render JavaScript. For dynamic sites use Web Scraper instead. To add Python modules not bundled here, open an issue or PR at github.com/apify/actor-beautifulsoup-scraper.

You might also like

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Getting started with Python and BeautifulSoup

omnipotent_recorder/namma-seo-auditor

Scrapes titles of websites using BeautifulSoup.

๐Ÿ‘ User avatar

Slam Book Cinema

1

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

๐Ÿ‘ User avatar

Matthias Stephens

522

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

๐Ÿš€ Transform web content into clean, LLM-ready Markdown! ๐Ÿ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! ๐ŸŒ๐Ÿ“๐Ÿง 

Python BeautifulSoup template

ellustar/my-actor-5

Python BeautifulSoup Actor Template: Streamline web scraping with this ready-to-use Python template. Effortlessly extract, parse, and manage data from websites using BeautifulSoup, with clean code, reusable functions, and flexible structure for fast, efficient automation projects.

TrustMRR Startup scraper

advantageous_subcontra/trustmrr

Get all startups listed in any category on TrustMRR startup database. Get all information about each startup, like revenue, founding year, and location.

66

Related articles

Python web scraping tutorial (Step-by-step guide)
Read more
Web scraping with Python Requests
Read more
Firecrawl vs. BeautifulSoup: Which is better for web scraping?
Read more