Implementing Web Scraping in Python with Scrapy

Last Updated : 15 Jun, 2026

Web data can be collected through APIs or scraping. BeautifulSoup works for small tasks, but it’s slow for large-scale use. Scrapy is a faster Python framework with asynchronous requests, parallel crawling, and built-in data handling ideal for handling millions of records efficiently.

Installation

Scrapy requires Python to be installed on the system. After installing Python, follow the steps below to install Scrapy.

1. Create a Virtual Environment (optional but recommended):

python -m venv scrapyenv

2. Activate the Virtual Environment:

On Windows:

scrapyenv\Scripts\activate

On macOS/Linux:

source scrapyenv/bin/activate

3. Install Scrapy:

python -m pip install scrapy

Web scraping

Web scraping means collecting data from websites and Scrapy makes it easy by letting you build "spiders" little programs that do the browsing and data collecting for you. Here’s how it works:

Start a project: Keeps your code and settings organized.
Create a spider: Tell Scrapy what sites to visit and what data to collect.
Parse the data: Extract the info you need, like titles or prices.

Scrapy handles the heavy lifting, like sending requests and following links, so you can focus on grabbing the data you care about.

1. Start scrapy project

To begin scraping with Scrapy, the first step is to create a well-structured project. Scrapy simplifies this process by automatically generating a complete directory layout for your project. To create a new project, run the following command:

scrapy startproject gfg

This creates a folder named gfg/ with the following structure:

👁 Image

2. Create your first spider

In Scrapy, spiders are Python classes that define how to follow links and extract data from websites. Now that your project is set up, it’s time to create your first spider.

1. Navigate to the spiders directory:

cd gfg/gfg/spiders

2. Create a new Python file for your spider. For example, you can name it gfgfetch.py:

gfgfetch.py

3. Define your spider: Open gfgfetch.py and add the following code to create a simple spider:

Explanation:

ExtractUrls spider starts by crawling https://www.geeksforgeeks.org/ and is restricted to the geeksforgeeks.org domain.
The spider extracts the page title and all anchor (<a>) tag URLs using CSS selectors.
For each URL, it returns a dictionary containing the page title and URL.
If the URL is valid, it recursively follows the link to scrape more pages.
This process continues, scraping URLs and titles from pages within the same domain.

Testing with scrapy shell

Before writing the parse function, it's helpful to test selectors using Scrapy Shell an interactive environment for trying out scraping commands:

scrapy shell https://www.geeksforgeeks.org/

Use CSS selectors to fetch data, e.g., to get all anchor tags with href:

response.css('a::attr(href)').getall()

Run the spider

To run and save the results:

scrapy crawl extract -o links.json

This creates a JSON file (links.json) with titles and links.

Output

[
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "https://www.geeksforgeeks.org/data-structures/"
},
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "https://www.geeksforgeeks.org/fundamentals-of-algorithms/"
},
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "/about/"
}
]

The spider extracts the page title and all hyperlinks found on the page. Each result is stored as a JSON object containing page_title and link.

Note: Scraping any website without permission may violate terms of service. Always check a site’s robots.txt file and get proper authorization before scraping.

Selector reference

Scrapy uses CSS selectors to extract data from HTML pages efficiently. Here's a quick guide to commonly used selectors when scraping links and content from web pages:

Purpose	Code	Example Output
Select all <a> tags	response.css('a')	<a href="...">...</a>
Extract full tag HTML	response.css('a').extract()	['<a href="...">Text</a>', ...]
Get all href links	response.css('a::attr(href)').getall()	['https://www.geeksforgeeks.org/', ...]

Example of extracted data:

<a href="https://www.geeksforgeeks.org/" title="GeeksforGeeks" rel="home">GeeksforGeeks</a>

Comment

Article Tags:

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/implementing-web-scraping-python-scrapy/