![]() |
VOOZH | about |
Web data can be collected through APIs or scraping. BeautifulSoup works for small tasks, but it’s slow for large-scale use. Scrapy is a faster Python framework with asynchronous requests, parallel crawling, and built-in data handling ideal for handling millions of records efficiently.
Scrapy requires Python to be installed on the system. After installing Python, follow the steps below to install Scrapy.
1. Create a Virtual Environment (optional but recommended):
python -m venv scrapyenv
2. Activate the Virtual Environment:
scrapyenv\Scripts\activate
source scrapyenv/bin/activate
3. Install Scrapy:
python -m pip install scrapy
Web scraping means collecting data from websites and Scrapy makes it easy by letting you build "spiders" little programs that do the browsing and data collecting for you. Here’s how it works:
Scrapy handles the heavy lifting, like sending requests and following links, so you can focus on grabbing the data you care about.
To begin scraping with Scrapy, the first step is to create a well-structured project. Scrapy simplifies this process by automatically generating a complete directory layout for your project. To create a new project, run the following command:
scrapy startproject gfg
This creates a folder named gfg/ with the following structure:
In Scrapy, spiders are Python classes that define how to follow links and extract data from websites. Now that your project is set up, it’s time to create your first spider.
1. Navigate to the spiders directory:
cd gfg/gfg/spiders
2. Create a new Python file for your spider. For example, you can name it gfgfetch.py:
gfgfetch.py
3. Define your spider: Open gfgfetch.py and add the following code to create a simple spider:
Explanation:
Before writing the parse function, it's helpful to test selectors using Scrapy Shell an interactive environment for trying out scraping commands:
scrapy shell https://www.geeksforgeeks.org/
Use CSS selectors to fetch data, e.g., to get all anchor tags with href:
response.css('a::attr(href)').getall()
To run and save the results:
scrapy crawl extract -o links.json
This creates a JSON file (links.json) with titles and links.
Output
[
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "https://www.geeksforgeeks.org/data-structures/"
},
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "https://www.geeksforgeeks.org/fundamentals-of-algorithms/"
},
{
"page_title": "GeeksforGeeks | A computer science portal for geeks",
"link": "/about/"
}
]
The spider extracts the page title and all hyperlinks found on the page. Each result is stored as a JSON object containing page_title and link.
Note: Scraping any website without permission may violate terms of service. Always check a site’s robots.txt file and get proper authorization before scraping.
Scrapy uses CSS selectors to extract data from HTML pages efficiently. Here's a quick guide to commonly used selectors when scraping links and content from web pages:
Purpose | Code | Example Output |
|---|---|---|
Select all <a> tags | response.css('a') | <a href="...">...</a> |
Extract full tag HTML | response.css('a').extract() | ['<a href="...">Text</a>', ...] |
Get all href links | response.css('a::attr(href)').getall() | ['https://www.geeksforgeeks.org/', ...] |
Example of extracted data:
<a href="https://www.geeksforgeeks.org/" title="GeeksforGeeks" rel="home">GeeksforGeeks</a>