BeautifulSoup Scraper

Pricing

Pay per usage

BeautifulSoup Scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Pricing

Pay per usage

Rating

5.0

(6)

Developer

👁 Apify

Apify

Maintained by Apify

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

How it works

You give the scraper two things: where to start and how to extract data.

It adds your Start URLs to the crawling queue.
It fetches each URL and builds a BeautifulSoup DOM from the HTML.
It runs your Page function on the page and stores the returned data.
Optionally, it follows links matching your Link selector / Link patterns and enqueues them for recursive crawling.

Page function

Python code run for every page. It receives a BeautifulSoupCrawlingContext and returns the data to store:

from typing import Any
from crawlee.crawlers import BeautifulSoupCrawlingContext
defpage_function(context: BeautifulSoupCrawlingContext)-> Any:
return{
'url': context.request.url,
'title': context.soup.title.string if context.soup.title elseNone,
}

The code runs on Python 3.14 and may only import modules already installed in the Actor.

Proxy configuration

A proxy is required. Set proxyConfiguration to use Apify Proxy (automatic or selected groups) or your own custom proxy URLs:

{
 "useApifyProxy": true, // use Apify Proxy
 "apifyProxyGroups": [], // optional: specific groups
 "proxyUrls": [] // or custom "scheme://user:pass@host:port" URLs
}

Output

Results returned by your page function land in the run's default dataset. Download them as JSON, CSV, XML, or Excel from Apify Console, or via the API:

https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json&clean=true

Limitations

The Actor uses raw HTTP requests, so it can't render JavaScript. For dynamic sites use Web Scraper instead. To add Python modules not bundled here, open an issue or PR at github.com/apify/actor-beautifulsoup-scraper.

👁 Cheerio Scraper avatar

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

👁 User avatar

Apify

17K

4.6

👁 Puppeteer Scraper avatar

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

👁 User avatar

Apify

15K

5.0

👁 Web Scraper avatar

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

👁 User avatar

Apify

119K

4.5

👁 Playwright Scraper avatar

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

👁 User avatar

Apify

9.2K

3.3

👁 Getting started with Python and BeautifulSoup avatar

Getting started with Python and BeautifulSoup

omnipotent_recorder/namma-seo-auditor

Scrapes titles of websites using BeautifulSoup.

👁 User avatar

Slam Book Cinema

👁 Vanilla JS Scraper avatar

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

👁 User avatar

Matthias Stephens

522

👁 RAG Web Browser avatar

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

👁 User avatar

Apify

109K

3.7

👁 Website Content to Markdown for LLM Training avatar

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

👁 User avatar

EasyApi

319

5.0

👁 Python BeautifulSoup template avatar

Python BeautifulSoup template

ellustar/my-actor-5

Python BeautifulSoup Actor Template: Streamline web scraping with this ready-to-use Python template. Effortlessly extract, parse, and manage data from websites using BeautifulSoup, with clean code, reusable functions, and flexible structure for fast, efficient automation projects.

👁 User avatar

Ellustar

👁 TrustMRR Startup scraper avatar

TrustMRR Startup scraper

advantageous_subcontra/trustmrr

Get all startups listed in any category on TrustMRR startup database. Get all information about each startup, like revenue, founding year, and location.

👁 User avatar

Fabian Maume

👁 Blog article image

Python web scraping tutorial (Step-by-step guide)

👁 Blog article image

Web scraping with Python Requests

👁 Blog article image

Firecrawl vs. BeautifulSoup: Which is better for web scraping?

URL: https://apify.com/apify/beautifulsoup-scraper