VOOZH about

URL: https://tech-insider.org/python-web-scraping-tutorial-beautifulsoup-playwright-2026/

⇱ Build a Python Web Scraper in 5 Steps [2026]


Skip to content
March 29, 2026
30 min read

Web scraping with Python remains one of the most in-demand skills for developers, data scientists, and business analysts in 2026. Whether you need to monitor competitor pricing, aggregate news feeds, or build datasets for machine learning, knowing how to extract data from the web programmatically is essential. This complete web scraping Python tutorial walks you through building a production-ready scraper from scratch, covering everything from basic HTTP requests to handling JavaScript-rendered pages with Playwright.

By the end of this tutorial, you will have a fully working Python web scraper that can extract structured data from both static and dynamic websites, handle anti-bot protections, store results in multiple formats, and run on a schedule. We will use the latest versions of BeautifulSoup, Requests, Playwright, and Scrapy available as of March 2026.

Prerequisites and Environment Setup

Before diving into web scraping with Python, you need a properly configured development environment. This section covers every tool, library, and version you will need throughout this tutorial. Getting the prerequisites right from the start prevents frustrating debugging sessions later when you encounter version incompatibilities or missing dependencies.

Python 3.11 or later is required for this tutorial. Python 3.12 and 3.13 are fully supported and recommended for their improved performance, particularly the faster startup times introduced in Python 3.12. You can verify your Python version by running python3 --version in your terminal. If you are on macOS, consider installing Python via Homebrew with brew install [email protected]. Windows users should download the latest installer from the official Python website and ensure the β€œAdd to PATH” checkbox is selected during installation.

We strongly recommend using a virtual environment to isolate your project dependencies. This prevents conflicts between packages required by different projects on your system. Create and activate a virtual environment with the following commands:

# Create project directory
mkdir python-web-scraper && cd python-web-scraper

# Create virtual environment
python3 -m venv venv

# Activate (macOS/Linux)
source venv/bin/activate

# Activate (Windows)
# venvScriptsactivate

# Verify activation
which python
# Should show: /path/to/python-web-scraper/venv/bin/python

Here is the complete list of dependencies and their versions used in this tutorial:

PackageVersionPurpose
Python3.12+Runtime environment
requests2.32+HTTP client for static pages
beautifulsoup44.12+HTML parsing and extraction
lxml5.3+Fast XML/HTML parser backend
playwright1.49+Browser automation for dynamic sites
scrapy2.12+Full-featured scraping framework
httpx0.28+Async HTTP client
pandas2.2+Data export and manipulation
fake-useragent2.0+Random User-Agent rotation

Install all required packages with a single command:

# Install all dependencies
pip install requests beautifulsoup4 lxml playwright httpx scrapy pandas fake-useragent

# Install Playwright browsers (Chromium, Firefox, WebKit)
playwright install

# Verify installations
python3 -c "import requests, bs4, lxml, playwright, httpx, scrapy, pandas; print('All packages installed successfully')"

You will also need a code editor such as VS Code with the Python extension, a terminal, and a stable internet connection. For the advanced sections, Docker is optional but recommended if you plan to deploy your scraper to a server. Make sure pip is up to date by running pip install --upgrade pip before installing the dependencies.

Step 1: Understanding How Web Scraping Works

Before writing any code, it is critical to understand the mechanics of web scraping with Python. At its core, web scraping involves three steps: sending an HTTP request to a web server, receiving the HTML response, and extracting the specific data you need from that response. This is exactly what your browser does when you visit a website, except your scraper does it programmatically without rendering the visual elements.

When you type a URL into your browser, it sends an HTTP GET request to the server. The server responds with HTML, CSS, JavaScript, and other assets. Your browser parses the HTML into a Document Object Model (DOM), executes JavaScript, applies CSS styles, and renders the page visually. A web scraper skips the rendering step entirely. It only needs the HTML content (or the data generated by JavaScript) to extract information.

There are two categories of websites you will encounter. Static websites serve fully formed HTML from the server. Every piece of data you see on the page exists in the initial HTML response. These sites are straightforward to scrape using the Requests library paired with BeautifulSoup. Dynamic websites, on the other hand, rely on JavaScript to load content after the initial HTML loads. Single-page applications built with React, Vue, or Angular fall into this category. Scraping these requires a headless browser like Playwright that can execute JavaScript and wait for content to render.

The legality of web scraping varies by jurisdiction and use case. In the United States, the 2022 hiQ Labs v. LinkedIn Supreme Court ruling established that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act. However, you should always check a website’s Terms of Service and robots.txt file before scraping. The robots.txt file, found at the root of most domains (e.g., https://example.com/robots.txt), specifies which paths crawlers are allowed or disallowed from accessing. While robots.txt is not legally binding in all jurisdictions, respecting it is considered an ethical best practice within the scraping community.

In 2026, websites increasingly deploy sophisticated anti-bot measures. Cloudflare, Akamai, and PerimeterX protect millions of sites with browser fingerprinting, CAPTCHAs, rate limiting, and JavaScript challenges. Understanding these defenses helps you build scrapers that work reliably without violating any terms. The key principle is to scrape responsibly: add delays between requests, identify yourself with a proper User-Agent header, and never overwhelm a server with requests.

Step 2: Your First Scraper with Requests and BeautifulSoup

Let us start with the most common web scraping Python pattern: using the Requests library to fetch HTML and BeautifulSoup to parse it. This combination handles the majority of static websites and is the foundation upon which more advanced techniques build. The Requests library provides a clean, intuitive API for making HTTP requests, while BeautifulSoup excels at navigating and searching HTML documents, even when the markup is poorly structured.

Create a new file called scraper_basic.py and add the following code:

import requests
from bs4 import BeautifulSoup
import csv
import time
from fake_useragent import UserAgent

# Initialize User-Agent rotator
ua = UserAgent()

def scrape_books(base_url: str, max_pages: int = 5) -> list[dict]:
 """Scrape book data from books.toscrape.com."""
 all_books = []

 for page in range(1, max_pages + 1):
 url = f"{base_url}/catalogue/page-{page}.html"
 headers = {
 "User-Agent": ua.random,
 "Accept": "text/html,application/xhtml+xml",
 "Accept-Language": "en-US,en;q=0.9",
 }

 try:
 response = requests.get(url, headers=headers, timeout=10)
 response.raise_for_status()
 except requests.RequestException as e:
 print(f"Error fetching {url}: {e}")
 continue

 soup = BeautifulSoup(response.text, "lxml")
 books = soup.select("article.product_pod")

 for book in books:
 title = book.select_one("h3 a")["title"]
 price = book.select_one(".price_color").get_text(strip=True)
 rating_class = book.select_one("p.star-rating")["class"][1]
 availability = book.select_one(".availability").get_text(strip=True)

 all_books.append({
 "title": title,
 "price": price,
 "rating": rating_class,
 "availability": availability,
 })

 print(f"Page {page}: scraped {len(books)} books")

 # Respectful delay between requests
 time.sleep(1.5)

 return all_books


def save_to_csv(data: list[dict], filename: str) -> None:
 """Save scraped data to a CSV file."""
 if not data:
 print("No data to save")
 return

 with open(filename, "w", newline="", encoding="utf-8") as f:
 writer = csv.DictWriter(f, fieldnames=data[0].keys())
 writer.writeheader()
 writer.writerows(data)

 print(f"Saved {len(data)} records to {filename}")


if __name__ == "__main__":
 BASE_URL = "https://books.toscrape.com"
 books = scrape_books(BASE_URL, max_pages=3)
 save_to_csv(books, "books_data.csv")
 print(f"nTotal books scraped: {len(books)}")

Let us break down what this code does. The scrape_books function iterates through paginated pages, sending GET requests with randomized User-Agent headers to mimic real browser traffic. For each page, BeautifulSoup parses the HTML using the fast lxml parser backend. CSS selectors like article.product_pod and .price_color target specific elements. A 1.5-second delay between requests prevents overwhelming the server.

Run the scraper with python3 scraper_basic.py. You should see output similar to:

Page 1: scraped 20 books
Page 2: scraped 20 books
Page 3: scraped 20 books
Saved 60 records to books_data.csv

Total books scraped: 60

The resulting CSV file will contain columns for title, price, rating, and availability. This basic pattern (request, parse, extract, store) applies to nearly every static web scraping project. The key to making it production-ready lies in reliable error handling, proper headers, and respectful request timing.

Step 3: Advanced CSS and XPath Selectors

The effectiveness of your web scraper depends heavily on your ability to write precise selectors. BeautifulSoup supports CSS selectors natively via the .select() method, while lxml provides XPath support for more complex extraction patterns. Mastering both gives you the flexibility to handle any HTML structure you encounter during web scraping with Python.

CSS selectors work similarly to how you style elements in a stylesheet. The select() method returns a list of matching elements, while select_one() returns only the first match. Here are the most useful patterns for web scraping:

Use soup.select("div.product > h2.title") to find h2 elements with class title that are direct children of a div with class product. The > combinator ensures you only get direct children, not deeply nested matches. Use soup.select("table tr:nth-child(n+2)") to skip the header row when extracting table data. The :nth-child(n+2) pseudo-class selects all rows starting from the second one.

For attribute-based selection, soup.select('a[href*="product"]') finds all links whose href attribute contains the word β€œproduct”. The *= operator performs a substring match. You can also use ^= for prefix matching and $= for suffix matching. These are invaluable when class names are dynamically generated, a common pattern in React and Angular applications.

XPath provides even more power when CSS selectors fall short. Using lxml directly, you can write expressions like //div[@class="content"]//p[contains(text(), "price")] to find paragraphs containing the word β€œprice” anywhere inside a content div. XPath’s following-sibling, preceding-sibling, and ancestor axes let you navigate the DOM in ways CSS selectors cannot. For instance, //h3[text()="Specifications"]/following-sibling::table[1] selects the first table after an H3 heading with the text β€œSpecifications” – a pattern you will encounter frequently on e-commerce product pages.

When selectors break, your scraper breaks. Websites change their HTML structure without notice, and class names generated by CSS-in-JS frameworks change with every deployment. Build resilience by using multiple fallback selectors. Try the most specific selector first, then fall back to broader patterns. You can also use the data-testid or data-* attributes that many modern frameworks add for testing purposes, as these tend to be more stable than generated class names.

Step 4: Scraping Dynamic JavaScript-Rendered Pages with Playwright

Many modern websites rely on JavaScript to render content. If you view the page source and the data you need is not in the HTML, you are dealing with a dynamic site. The Requests library cannot execute JavaScript, so you need a headless browser. Playwright, maintained by Microsoft with over 71,000 GitHub stars, is the best tool for this job in 2026. It supports Chromium, Firefox, and WebKit, runs in headless mode by default, and offers a Python API that is both powerful and intuitive.

Here is a complete example that scrapes dynamically loaded content using Playwright:

import asyncio
from playwright.async_api import async_playwright
import json


async def scrape_dynamic_site(url: str) -> list[dict]:
 """Scrape a JavaScript-rendered page using Playwright."""
 results = []

 async with async_playwright() as p:
 browser = await p.chromium.launch(
 headless=True,
 args=["--disable-blink-features=AutomationControlled"]
 )
 context = await browser.new_context(
 viewport={"width": 1920, "height": 1080},
 user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
 "AppleWebKit/537.36 (KHTML, like Gecko) "
 "Chrome/131.0.0.0 Safari/537.36",
 locale="en-US",
 )
 page = await context.new_page()

 # Block unnecessary resources for speed
 await page.route(
 "**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}",
 lambda route: route.abort()
 )

 await page.goto(url, wait_until="networkidle", timeout=30000)

 # Wait for specific content to load
 await page.wait_for_selector(".product-card", timeout=10000)

 # Scroll to trigger lazy-loaded content
 for _ in range(5):
 await page.evaluate("window.scrollBy(0, window.innerHeight)")
 await asyncio.sleep(0.5)

 # Extract data using page.evaluate for performance
 results = await page.evaluate("""
 () => {
 const cards = document.querySelectorAll('.product-card');
 return Array.from(cards).map(card => ({
 name: card.querySelector('.product-name')?.textContent?.trim(),
 price: card.querySelector('.product-price')?.textContent?.trim(),
 link: card.querySelector('a')?.href,
 }));
 }
 """)

 await browser.close()

 return results


async def main():
 url = "https://example-spa.com/products"
 data = await scrape_dynamic_site(url)
 print(f"Scraped {len(data)} products")

 with open("products.json", "w") as f:
 json.dump(data, f, indent=2)
 print("Data saved to products.json")


if __name__ == "__main__":
 asyncio.run(main())

Several important techniques are demonstrated in this Playwright scraper. First, we block image, font, and CSS resources with page.route() to dramatically speed up page loads, since we only need the HTML content. Second, we use wait_until="networkidle" to ensure the page has finished loading API data before we start extracting. Third, we scroll the page to trigger lazy-loaded content that only appears when the user scrolls down. Fourth, we use page.evaluate() to run JavaScript directly in the browser context, which is faster than extracting elements individually from Python.

Playwright also excels at intercepting network requests. You can capture the underlying API calls that a website makes and extract structured JSON data directly, bypassing HTML parsing entirely. Add page.on("response", handle_response) to intercept API responses. This technique is often cleaner and more reliable than parsing the rendered HTML, especially for single-page applications that fetch data from REST or GraphQL endpoints.

Step 5: Handling Pagination and Infinite Scroll

Real-world web scraping projects almost always involve collecting data across multiple pages. Websites implement pagination in three common patterns: traditional page links (page 1, 2, 3), β€œLoad More” buttons, and infinite scroll. Each requires a different approach in your Python web scraper.

For traditional pagination with numbered links, the approach is straightforward. Identify the URL pattern (often ?page=2 or /page/2/), then loop through pages until you either reach a specified maximum or detect that no more data is available. The scraper we built in Step 2 already demonstrates this pattern. The key is detecting the last page: check if the β€œNext” button is disabled, or if the current page returns fewer items than expected.

For β€œLoad More” buttons, you need Playwright or Selenium. The approach involves clicking the button repeatedly until it either disappears or becomes disabled. Here is a reusable pattern:

async def scrape_with_load_more(page, button_selector: str, max_clicks: int = 20):
 """Click a 'Load More' button until all content is loaded."""
 clicks = 0
 while clicks < max_clicks:
 try:
 button = await page.wait_for_selector(
 button_selector, timeout=5000, state="visible"
 )
 if not button:
 break

 is_disabled = await button.get_attribute("disabled")
 if is_disabled:
 break

 await button.click()
 clicks += 1

 # Wait for new content to load
 await page.wait_for_load_state("networkidle", timeout=10000)
 await asyncio.sleep(1)

 print(f"Clicked 'Load More' {clicks} times")

 except Exception:
 print("No more 'Load More' button found")
 break

 return clicks

Infinite scroll pages load new content automatically as you scroll down. The technique is similar to what we used in Step 4: programmatically scroll the page and wait for new content to appear. Track the number of items before and after each scroll. When the count stops increasing, you have reached the bottom. Set a maximum scroll count to prevent infinite loops on pages with truly endless content feeds.

A professional Python web scraper also handles pagination edge cases. Duplicate items can appear when pages overlap or when content is reordered between requests. Use a set of unique identifiers (product IDs, URLs, or content hashes) to deduplicate results. Rate limiting is equally important: adding a 1-3 second delay between page requests keeps your scraper from being blocked and shows respect for the target server's resources.

Step 6: Storing Scraped Data in Multiple Formats

A scraper is only as useful as the data it produces. Depending on your use case, you may need to export data as CSV, JSON, a SQLite database, or directly into a pandas DataFrame for analysis. This step shows you how to implement all four export methods in a clean, reusable way that you can plug into any web scraping Python project.

import csv
import json
import sqlite3
from pathlib import Path
import pandas as pd


class DataExporter:
 """Export scraped data in multiple formats."""

 def __init__(self, data: list[dict], project_name: str = "scrape_output"):
 self.data = data
 self.project_name = project_name
 self.output_dir = Path("output")
 self.output_dir.mkdir(exist_ok=True)

 def to_csv(self) -> str:
 filepath = self.output_dir / f"{self.project_name}.csv"
 if not self.data:
 return str(filepath)
 with open(filepath, "w", newline="", encoding="utf-8") as f:
 writer = csv.DictWriter(f, fieldnames=self.data[0].keys())
 writer.writeheader()
 writer.writerows(self.data)
 print(f"CSV saved: {filepath} ({len(self.data)} rows)")
 return str(filepath)

 def to_json(self) -> str:
 filepath = self.output_dir / f"{self.project_name}.json"
 with open(filepath, "w", encoding="utf-8") as f:
 json.dump(self.data, f, indent=2, ensure_ascii=False)
 print(f"JSON saved: {filepath} ({len(self.data)} records)")
 return str(filepath)

 def to_sqlite(self, table_name: str = "scraped_data") -> str:
 filepath = self.output_dir / f"{self.project_name}.db"
 df = pd.DataFrame(self.data)
 conn = sqlite3.connect(filepath)
 df.to_sql(table_name, conn, if_exists="replace", index=False)
 conn.close()
 print(f"SQLite saved: {filepath} ({len(self.data)} rows)")
 return str(filepath)

 def to_dataframe(self) -> pd.DataFrame:
 df = pd.DataFrame(self.data)
 print(f"DataFrame created: {df.shape[0]} rows x {df.shape[1]} columns")
 return df

 def export_all(self) -> dict:
 return {
 "csv": self.to_csv(),
 "json": self.to_json(),
 "sqlite": self.to_sqlite(),
 }


# Usage example
if __name__ == "__main__":
 sample_data = [
 {"title": "Python Crash Course", "price": "$29.99", "rating": "Five"},
 {"title": "Automate the Boring Stuff", "price": "$24.99", "rating": "Four"},
 ]
 exporter = DataExporter(sample_data, "books")
 paths = exporter.export_all()
 print(f"nExported to: {paths}")

The DataExporter class encapsulates all export logic in a single reusable component. For CSV output, we use Python's built-in csv.DictWriter which handles quoting and escaping automatically. JSON export uses ensure_ascii=False to properly handle Unicode characters in international content. The SQLite export uses pandas' to_sql() method, which automatically creates the table schema based on your data's structure. This is particularly useful for large datasets where you want to run SQL queries for analysis.

For production scrapers, consider using SQLite as your primary storage and exporting to CSV or JSON on demand. SQLite handles concurrent writes safely, supports indexing for fast queries, and requires no server setup. If your dataset exceeds a few million rows or you need concurrent access from multiple processes, upgrade to PostgreSQL using the psycopg2 library with the same pandas to_sql() interface.

Step 7: Building a Production Scraper with Scrapy

While Requests and BeautifulSoup are perfect for small to medium projects, Scrapy is the industry standard for large-scale web scraping in Python. With over 54,800 GitHub stars, Scrapy provides a complete framework that handles request scheduling, concurrency, middleware pipelines, auto-throttling, and data export out of the box. If you are building a scraper that needs to crawl thousands of pages or run in production, Scrapy is the right choice.

Create a new Scrapy project and spider with these commands:

# Create Scrapy project
scrapy startproject bookstore
cd bookstore

# Generate a spider
scrapy genspider books books.toscrape.com

Now edit the generated spider file at bookstore/spiders/books.py:

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose


def clean_price(value: str) -> float:
 """Convert price string to float."""
 return float(value.replace("Β£", "").replace("$", "").strip())


RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}


class BookItem(scrapy.Item):
 title = scrapy.Field()
 price = scrapy.Field(input_processor=MapCompose(clean_price))
 rating = scrapy.Field()
 rating_num = scrapy.Field()
 category = scrapy.Field()
 availability = scrapy.Field()
 url = scrapy.Field()


class BooksSpider(scrapy.Spider):
 name = "books"
 allowed_domains = ["books.toscrape.com"]
 start_urls = ["https://books.toscrape.com/"]

 custom_settings = {
 "CONCURRENT_REQUESTS": 4,
 "DOWNLOAD_DELAY": 1.0,
 "AUTOTHROTTLE_ENABLED": True,
 "AUTOTHROTTLE_TARGET_CONCURRENCY": 2.0,
 "FEEDS": {
 "output/books.json": {"format": "json", "overwrite": True},
 "output/books.csv": {"format": "csv", "overwrite": True},
 },
 "USER_AGENT": "TechInsiderBot/1.0 (+https://tech-insider.org)",
 "ROBOTSTXT_OBEY": True,
 "LOG_LEVEL": "INFO",
 }

 def parse(self, response):
 """Parse the main listing page."""
 for book in response.css("article.product_pod"):
 loader = ItemLoader(item=BookItem(), selector=book)
 loader.default_output_processor = TakeFirst()

 loader.add_css("title", "h3 a::attr(title)")
 loader.add_css("price", ".price_color::text")
 rating_class = book.css("p.star-rating::attr(class)").get()
 rating_word = rating_class.split()[-1] if rating_class else "Zero"
 loader.add_value("rating", rating_word)
 loader.add_value("rating_num", RATING_MAP.get(rating_word, 0))
 loader.add_css("availability", ".availability::text")
 loader.add_css("url", "h3 a::attr(href)")

 yield loader.load_item()

 # Follow pagination
 next_page = response.css("li.next a::attr(href)").get()
 if next_page:
 yield response.follow(next_page, callback=self.parse)

Run the spider with scrapy crawl books. Scrapy automatically handles pagination by following the "next" link, respects robots.txt, throttles requests to avoid overloading the server, and exports data to both JSON and CSV simultaneously. The ItemLoader with input processors cleans the data as it is extracted, converting price strings to floats automatically.

Scrapy's architecture is built around middleware pipelines. You can add custom middleware to rotate proxies, retry failed requests, handle CAPTCHAs, or feed data into a database. The AUTOTHROTTLE feature dynamically adjusts request speed based on server response times, making your scraper both efficient and polite. For projects that need to scale beyond a single machine, Scrapy integrates with Scrapy Cloud and distributed task queues like Celery.

Step 8: Bypassing Anti-Bot Protections Ethically

As web scraping Python projects grow in scope, you will inevitably encounter anti-bot protections. Understanding these defenses and knowing how to handle them ethically is essential for building reliable scrapers. The goal is not to circumvent security measures for malicious purposes, but to collect publicly available data in a way that does not harm the target website.

The most common anti-bot measures in 2026 include IP-based rate limiting, User-Agent analysis, browser fingerprinting, JavaScript challenges, and CAPTCHAs. Here are ethical strategies to handle each:

User-Agent Rotation: Sending the same User-Agent string with every request is a clear bot signal. The fake-useragent library provides a pool of real browser User-Agent strings that you can rotate between requests. Always use recent, valid User-Agent strings that match current browser versions.

Request Headers: Real browsers send a consistent set of headers including Accept, Accept-Language, Accept-Encoding, and Referer. Missing or inconsistent headers are a common reason scrapers get blocked. Copy the headers from your browser's developer tools Network tab and replicate them in your requests.

Request Timing: Humans do not click links at exact 1-second intervals. Add randomized delays between requests using time.sleep(random.uniform(1.0, 3.0)). This makes your request pattern look more natural and reduces server load. Scrapy's AUTOTHROTTLE feature handles this automatically.

Session Management: Maintain cookies across requests using a requests.Session() object. Many websites set tracking cookies on the first visit and expect them on subsequent requests. A session also reuses TCP connections, which improves performance and reduces your footprint on the server.

Proxy Rotation: If you need to make a high volume of requests, distributing them across multiple IP addresses via proxy rotation prevents any single IP from being rate-limited. Services like BrightData and Oxylabs offer residential proxy pools specifically designed for web scraping. For smaller projects, free proxy lists are available but tend to be unreliable and slow.

Anti-Bot MeasureDetection MethodEthical Counter-Strategy
IP Rate LimitingToo many requests from one IPAdd delays, use proxy rotation
User-Agent CheckMissing or bot-like UA stringRotate real browser User-Agents
Browser FingerprintingMissing JS APIs, WebGL, fontsUse Playwright with real browser
JavaScript ChallengeClient must execute JS codeUse headless browser (Playwright)
CAPTCHARequires human verificationSlow down requests, use sessions
Honeypot LinksHidden links only bots followCheck element visibility before clicking

Remember that ethical web scraping means only collecting publicly available data, respecting rate limits, honoring robots.txt, and never attempting to access authenticated or private content without authorization. If a website explicitly blocks scraping in its Terms of Service and deploys technical measures to enforce it, consider whether the data is available through an official API instead.

Step 9: Error Handling and Retry Logic

Production web scrapers must handle failures gracefully. Network timeouts, HTTP errors, changed page structures, and rate limiting are all inevitable. A scraper without proper error handling will crash at the worst possible time, losing hours of progress. This step shows you how to build resilient error handling into your web scraping Python code.

The most important pattern is exponential backoff with retries. When a request fails, wait before retrying, and increase the wait time with each subsequent failure. This prevents your scraper from hammering a server that is already struggling:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def create_session_with_retries(
 max_retries: int = 3,
 backoff_factor: float = 1.0,
 status_forcelist: tuple = (429, 500, 502, 503, 504),
) -> requests.Session:
 """Create a requests Session with automatic retry logic."""
 session = requests.Session()

 retry_strategy = Retry(
 total=max_retries,
 backoff_factor=backoff_factor,
 status_forcelist=status_forcelist,
 allowed_methods=["GET", "HEAD"],
 raise_on_status=False,
 )

 adapter = HTTPAdapter(max_retries=retry_strategy)
 session.mount("http://", adapter)
 session.mount("https://", adapter)

 return session


def safe_request(session: requests.Session, url: str) -> requests.Response | None:
 """Make a request with comprehensive error handling."""
 try:
 response = session.get(url, timeout=(5, 30))

 if response.status_code == 200:
 return response
 elif response.status_code == 403:
 logger.warning(f"Access forbidden: {url} - check headers/IP")
 elif response.status_code == 404:
 logger.warning(f"Page not found: {url}")
 elif response.status_code == 429:
 retry_after = response.headers.get("Retry-After", "60")
 logger.warning(f"Rate limited. Retry after {retry_after}s")
 else:
 logger.warning(f"HTTP {response.status_code} for {url}")

 return None

 except requests.exceptions.ConnectTimeout:
 logger.error(f"Connection timeout: {url}")
 except requests.exceptions.ReadTimeout:
 logger.error(f"Read timeout: {url}")
 except requests.exceptions.ConnectionError:
 logger.error(f"Connection failed: {url}")
 except requests.exceptions.TooManyRedirects:
 logger.error(f"Too many redirects: {url}")

 return None


# Usage
session = create_session_with_retries(max_retries=3, backoff_factor=1.5)
response = safe_request(session, "https://books.toscrape.com")
if response:
 print(f"Success: {len(response.text)} bytes received")

The create_session_with_retries function configures urllib3's built-in retry mechanism. With backoff_factor=1.0, retries wait 1, 2, and 4 seconds respectively. The status_forcelist specifies which HTTP status codes trigger automatic retries: 429 (rate limited), 500 (server error), 502 (bad gateway), 503 (service unavailable), and 504 (gateway timeout). The safe_request wrapper adds granular logging for each failure type, making debugging much easier in production.

Beyond HTTP errors, you also need to handle parsing errors. Wrap your BeautifulSoup extraction code in try/except blocks to catch AttributeError (when a selector returns None) and TypeError (when the page structure changes). Log the URL and the selector that failed so you can quickly identify which pages need attention. For long-running scrapers, implement checkpointing by saving your progress periodically. Store the list of already-scraped URLs in a file or database so you can resume from where you left off if the scraper crashes.

Step 10: Async Scraping with httpx for Maximum Performance

When you need to scrape hundreds or thousands of static pages, synchronous requests become a bottleneck. Each request blocks until it receives a response, leaving your scraper idle during network round-trips. Asynchronous scraping with httpx and Python's asyncio allows you to send multiple requests concurrently, dramatically reducing total scrape time. This is one of the most impactful optimizations for any web scraping Python project.

import httpx
import asyncio
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
import json
import time


@dataclass
class ScrapedItem:
 url: str
 title: str
 price: str
 rating: str


async def fetch_page(
 client: httpx.AsyncClient, url: str, semaphore: asyncio.Semaphore
) -> str | None:
 """Fetch a single page with concurrency control."""
 async with semaphore:
 try:
 response = await client.get(url, timeout=15.0)
 response.raise_for_status()
 await asyncio.sleep(0.5) # Rate limiting
 return response.text
 except httpx.HTTPError as e:
 print(f"Error fetching {url}: {e}")
 return None


def parse_page(html: str, url: str) -> list[ScrapedItem]:
 """Parse a single page and extract items."""
 soup = BeautifulSoup(html, "lxml")
 items = []
 for book in soup.select("article.product_pod"):
 items.append(ScrapedItem(
 url=url,
 title=book.select_one("h3 a")["title"],
 price=book.select_one(".price_color").get_text(strip=True),
 rating=book.select_one("p.star-rating")["class"][1],
 ))
 return items


async def scrape_all(base_url: str, total_pages: int = 50) -> list[dict]:
 """Scrape multiple pages concurrently."""
 urls = [f"{base_url}/catalogue/page-{i}.html" for i in range(1, total_pages + 1)]
 semaphore = asyncio.Semaphore(5) # Max 5 concurrent requests
 all_items = []

 async with httpx.AsyncClient(
 headers={"User-Agent": "Mozilla/5.0 (compatible; TechInsiderBot/1.0)"},
 follow_redirects=True,
 ) as client:
 tasks = [fetch_page(client, url, semaphore) for url in urls]
 pages = await asyncio.gather(*tasks)

 for url, html in zip(urls, pages):
 if html:
 items = parse_page(html, url)
 all_items.extend(items)

 return [asdict(item) for item in all_items]


if __name__ == "__main__":
 start = time.perf_counter()
 results = asyncio.run(scrape_all("https://books.toscrape.com", total_pages=50))
 elapsed = time.perf_counter() - start
 print(f"Scraped {len(results)} items in {elapsed:.2f} seconds")

 with open("output/async_results.json", "w") as f:
 json.dump(results, f, indent=2)

The key to this approach is the asyncio.Semaphore, which limits concurrent requests to 5. Without this throttle, you could inadvertently send hundreds of simultaneous requests, overwhelming the target server and getting your IP banned. The asyncio.gather() call dispatches all requests concurrently within the semaphore's limit, and the 0.5-second sleep inside fetch_page adds a per-request delay for additional politeness.

In benchmarks, async scraping with httpx is typically 5-10x faster than synchronous requests for I/O-bound tasks. Scraping 50 pages that takes 75 seconds synchronously can complete in under 15 seconds with 5 concurrent connections. The trade-off is increased complexity and the need to carefully manage concurrency to avoid overwhelming targets. For most web scraping Python projects with more than 100 pages, the performance gain is well worth the additional code.

Step 11: Scheduling and Automating Your Scraper

A scraper that runs once is a script. A scraper that runs on a schedule is a data pipeline. Many use cases, from price monitoring to news aggregation, require regular data collection. Python offers several ways to schedule and automate your web scraper, from simple cron jobs to full orchestration platforms.

The simplest approach is a cron job on Linux or macOS. Add an entry to your crontab with crontab -e:

# Run scraper every day at 6 AM UTC
0 6 * * * cd /home/user/python-web-scraper && /home/user/python-web-scraper/venv/bin/python scraper_basic.py >> /var/log/scraper.log 2>&1

# Run every 6 hours
0 */6 * * * cd /home/user/python-web-scraper && /home/user/python-web-scraper/venv/bin/python scraper_basic.py >> /var/log/scraper.log 2>&1

For more sophisticated scheduling within Python, the schedule library provides a human-readable API. For production deployments, consider using Celery with Redis as a message broker for distributed task execution, or Apache Airflow for complex workflows with dependencies. Docker containerization is also recommended for deployment: package your scraper, its dependencies, and a cron scheduler into a Docker image for consistent execution across environments.

Always implement monitoring for scheduled scrapers. Log the number of items scraped, any errors encountered, and the total execution time for each run. Set up alerts (via email, Slack, or PagerDuty) for runs that fail or return significantly fewer items than expected. This early warning system prevents silent failures where your scraper stops collecting data without anyone noticing for days or weeks.

Step 12: Complete Working Project Structure

Now let us bring everything together into a well-organized project structure that you can use as a template for any Python web scraping project. A clean project structure makes your code easier to maintain, test, and extend:

python-web-scraper/
β”œβ”€β”€ venv/
β”œβ”€β”€ config/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── settings.py # Configuration (URLs, delays, export formats)
β”œβ”€β”€ scrapers/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ base.py # Abstract base scraper class
β”‚ β”œβ”€β”€ static_scraper.py # Requests + BeautifulSoup scraper
β”‚ β”œβ”€β”€ dynamic_scraper.py # Playwright scraper
β”‚ └── async_scraper.py # httpx async scraper
β”œβ”€β”€ exporters/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── data_exporter.py # CSV, JSON, SQLite export
β”œβ”€β”€ output/ # Scraped data files
β”œβ”€β”€ logs/ # Log files
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ test_scraper.py
β”‚ └── test_exporter.py
β”œβ”€β”€ main.py # Entry point
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── README.md

The config/settings.py file centralizes all configuration. Instead of hardcoding URLs, delays, and export paths throughout your code, define them in one place. Use environment variables for sensitive values like proxy credentials. The scrapers/base.py defines an abstract base class that all scraper types inherit from, ensuring a consistent interface. Each scraper implementation (static, dynamic, async) lives in its own module and can be selected at runtime based on the target website's requirements.

This modular structure lets you add new scrapers for different websites without modifying existing code. When a website changes its layout, you only need to update the selectors in the relevant scraper module. The exporter is decoupled from the scraping logic, so you can change how data is stored without touching the scraping code. This separation of concerns is what distinguishes a disposable script from a maintainable production system.

Common Pitfalls and How to Avoid Them

Even experienced developers make these mistakes when building Python web scrapers. Here are the most common pitfalls and how to avoid them:

Pitfall 1: Not Setting a Timeout. The default requests.get() call has no timeout, meaning your scraper can hang indefinitely on a single request. Always set both connect and read timeouts: requests.get(url, timeout=(5, 30)). The first value is the connection timeout (5 seconds), and the second is the read timeout (30 seconds).

Pitfall 2: Ignoring Character Encoding. Websites use different character encodings (UTF-8, ISO-8859-1, etc.). If your scraped text contains garbled characters, the encoding detection failed. Force UTF-8 decoding with response.encoding = "utf-8" or use response.content.decode("utf-8", errors="replace") as a fallback.

Pitfall 3: Scraping Without Checking robots.txt. Ignoring robots.txt is both unethical and risky. Websites may block your IP permanently or take legal action. Use Python's built-in urllib.robotparser to programmatically check if a URL is allowed before scraping it.

Pitfall 4: Hardcoding Selectors Without Fallbacks. When a website updates its design, hardcoded selectors break instantly. Always wrap selectors in try/except blocks and provide fallback selectors. Log warnings when primary selectors fail so you know which pages need attention.

Pitfall 5: Not Handling Relative URLs. Many websites use relative URLs in their links and images. If you extract a link like /products/123, you need to join it with the base URL to get the full path. Use urllib.parse.urljoin(base_url, relative_url) to handle this correctly.

Pitfall 6: Memory Leaks in Long-Running Scrapers. If you store all scraped data in a Python list, memory usage grows unbounded. For large scrapes, write data to disk incrementally (append to a CSV or insert into a database) rather than accumulating everything in memory.

Pitfall 7: Using Regular Expressions to Parse HTML. Regular expressions are fragile and error-prone for HTML parsing. A single unexpected attribute, whitespace change, or nested tag breaks the pattern. Always use a proper parser like BeautifulSoup or lxml instead of regex for extracting data from HTML.

Troubleshooting Guide

When your Python web scraper encounters issues, use this thorough troubleshooting guide to diagnose and fix the problem quickly:

Issue 1: "403 Forbidden" Response. The server is blocking your request. Add realistic headers including User-Agent, Accept, Accept-Language, and Referer. If that does not work, try using a requests.Session() to maintain cookies. As a last resort, switch to Playwright to render the page with a real browser engine.

Issue 2: Empty Results Despite Visible Page Content. The content is loaded dynamically via JavaScript. View the page source (Ctrl+U) rather than the inspector. If the data is not in the source, use Playwright or check the Network tab in developer tools for API calls that return the data as JSON.

Issue 3: "ConnectionError: Max retries exceeded." The server is rejecting your connection. You are likely sending requests too quickly. Increase your delay between requests to 3-5 seconds. Check if your IP is blocked by trying the URL in your browser. Consider rotating proxies if the problem persists.

Issue 4: "AttributeError: NoneType has no attribute 'text'." Your CSS selector returned None because the element was not found on the page. The HTML structure may have changed, or the page may have loaded differently. Print the raw HTML to a file and inspect it manually. Use if element is not None checks before accessing attributes.

Issue 5: Playwright Timeout Errors. Increase the timeout in page.goto() and wait_for_selector(). If specific elements never appear, the page may require interaction (clicking a cookie banner, closing a popup) before the content loads. Use page.screenshot() to capture what the page looks like at the time of the timeout.

Issue 6: Duplicate Data in Results. Pagination overlaps, AJAX loading duplicates, or your scraper revisiting the same pages can cause duplicates. Use a set to track unique identifiers (URLs or item IDs) and skip items you have already seen. In Scrapy, enable the built-in duplicate filter with DUPEFILTER_CLASS.

Issue 7: Scrapy "Filtered offsite request" Warning. Your spider is trying to follow links to domains not listed in allowed_domains. Either add the domain to the list or remove the allowed_domains setting entirely if you want to follow external links.

Issue 8: SSL Certificate Errors. Some websites have misconfigured SSL certificates. While you can bypass this with verify=False in Requests, this is insecure. A better approach is to update your system's CA certificates with pip install certifi and ensure requests uses the latest certificate bundle. Only disable verification for local development or internal sites.

Issue 9: Scraped Data Contains HTML Tags. Use .get_text(strip=True) in BeautifulSoup instead of .text to remove whitespace and nested tags. For Scrapy, the TakeFirst() output processor combined with CSS ::text pseudo-element handles this automatically.

Advanced Tips and Best Practices

Take your web scraping Python skills to the next level with these advanced techniques used by professional data engineers and scraping specialists:

Intercept API Calls Instead of Parsing HTML. Many SPAs fetch data from internal APIs. Use your browser's Network tab (filter by XHR/Fetch) to find these endpoints. Often you can call the API directly with Requests, getting clean JSON without any HTML parsing. This approach is faster, more reliable, and less likely to break when the front-end changes.

Use Browser DevTools Protocol (CDP) for Stealth. Playwright supports the Chrome DevTools Protocol for advanced scenarios. You can modify the browser's JavaScript environment before page load to remove automation detection flags like navigator.webdriver. This makes your headless browser indistinguishable from a real user's browser for most fingerprinting checks.

Implement Smart Caching. If you scrape the same site repeatedly, cache pages locally using a hash of the URL as the filename. Before making a network request, check if a recent cached version exists. This reduces load on the target server and speeds up development when you are iterating on your extraction logic. The requests-cache library adds transparent caching to any Requests-based scraper with a single line of code.

Use Selectolax for Maximum Parsing Speed. When processing millions of HTML documents, BeautifulSoup can become a bottleneck. The selectolax library, built on the Modest C engine, parses HTML up to 30x faster than BeautifulSoup. Its API is similar enough that migration requires minimal code changes, making it the go-to parser for high-volume web scraping in Python.

Use Structured Data. Many websites embed structured data using JSON-LD, Schema.org markup, or Open Graph tags. Extract this metadata with a simple soup.find("script", type="application/ld+json") call. The resulting JSON contains clean, structured data about products, articles, events, and more, often including information not visible on the page itself.

Python Web Scraping Library Comparison (2026)

Choosing the right library for your web scraping Python project depends on your specific requirements. Here is a detailed comparison of the major libraries available in 2026:

LibraryBest ForSpeedJS SupportGitHub StarsLearning Curve
Requests + BS4Static sites, beginnersVery FastNo52k + 49kEasy
ScrapyLarge-scale crawlingFastVia Splash54.8kModerate
PlaywrightDynamic JS sitesModerateFull71.5kModerate
SeleniumLegacy browser automationSlowFull32kModerate
httpxAsync static scrapingVery FastNo13.5kModerate
SelectolaxHigh-volume parsingFastestNo1.2kEasy

For most developers starting out with web scraping in Python, the Requests plus BeautifulSoup combination is the right choice. It handles the majority of websites, has extensive documentation, and is the most widely taught approach. Graduate to Scrapy when your project grows beyond a single script, and add Playwright when you encounter JavaScript-rendered content. The httpx library is ideal when you need to maximize throughput on static sites, and Selectolax is a drop-in performance upgrade for high-volume parsing workloads.

Frequently Asked Questions

Is web scraping with Python legal?

Web scraping publicly available data is generally legal in the United States following the hiQ Labs v. LinkedIn ruling. However, scraping copyrighted content, personal data protected by GDPR or CCPA, or data behind authentication without permission may violate laws. Always check the website's Terms of Service and robots.txt file. When in doubt, consult a legal professional familiar with data privacy laws in your jurisdiction.

Which Python library is best for web scraping in 2026?

For static websites, the combination of Requests and BeautifulSoup remains the best choice for most developers. For JavaScript-heavy sites, Playwright is the leading tool with over 71,000 GitHub stars and excellent Python support. For large-scale crawling projects, Scrapy provides a complete framework with built-in concurrency, middleware, and export capabilities. The best library depends on your specific use case and the type of website you need to scrape.

How do I scrape a website that uses JavaScript?

Use a headless browser like Playwright or Selenium. Playwright is recommended in 2026 for its speed, cross-browser support, and modern async API. Install it with pip install playwright && playwright install, then use page.goto() to load the page and page.wait_for_selector() to wait for dynamic content. Alternatively, check the browser's Network tab for API endpoints that return data as JSON, which can be called directly with Requests.

How can I avoid getting blocked while scraping?

Rotate User-Agent strings, add realistic headers, implement delays between requests (1-3 seconds), maintain sessions with cookies, and respect robots.txt. For high-volume scraping, use proxy rotation services. The most important practice is being respectful: do not send more requests than necessary, scrape during off-peak hours when possible, and cache pages you have already downloaded.

Can I scrape data behind a login page?

Technically yes, using requests.Session() to maintain login cookies or Playwright to automate the login form. However, this is ethically and legally questionable unless you are scraping your own account data. Most websites' Terms of Service prohibit automated access to authenticated content. Consider using the website's official API if one is available, or contact the website owner to request data access.

How fast can Python web scrapers run?

With synchronous Requests, expect about 1-2 pages per second with polite delays. Async scraping with httpx can achieve 10-20 pages per second with controlled concurrency. Scrapy, with its built-in async engine, handles similar throughput. Playwright is slower at 0.5-1 pages per second due to browser rendering overhead. For maximum speed on static sites, combine httpx with Selectolax for parsing, which can process hundreds of pages per second in raw throughput.

What is the difference between Scrapy and BeautifulSoup?

BeautifulSoup is a parsing library that extracts data from HTML documents. Scrapy is a complete web scraping framework that includes HTTP client, request scheduling, concurrency management, data pipelines, and export functionality. BeautifulSoup requires you to build the infrastructure yourself (HTTP requests, pagination, error handling), while Scrapy provides all of this out of the box. Use BeautifulSoup for simple scripts and Scrapy for production crawling systems.

How do I store large amounts of scraped data?

For datasets under 1 million rows, SQLite is ideal: zero configuration, single file, and supports SQL queries. For larger datasets or concurrent access, use PostgreSQL with the psycopg2 library. For semi-structured or deeply nested data, MongoDB works well. Always write data incrementally (row by row or in batches) rather than accumulating everything in memory. Use pandas for post-processing and analysis once the data is stored.

Related Coverage

Explore more tutorials and comparisons on tech-insider.org:

Last updated: March 29, 2026

πŸ‘ Marcus Chen

Marcus Chen

Senior Tech Reporter

Marcus Chen is a Senior Tech Reporter at Tech Insider covering cloud computing, enterprise software, and the business of technology. Before joining TI, he spent five years at ZDNet covering digital transformation across European enterprises and three years at The Register reporting on cloud infrastructure. Marcus is known for his deep dives into cloud cost optimization and multi-cloud strategy. He holds a degree in Computer Science from Imperial College London and speaks regularly at KubeCon and CloudNative events.

View all articles
πŸ‘ Tech Insider
Tech
Insider

Tech Insider delivers in-depth coverage of the technologies shaping the future: AI, cybersecurity, cloud computing, hardware, and the trends that matter.

Company

Explore

Categories

Β© 2026 Tech Insider Media AB. All rights reserved.