VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/06/web-scraping-with-python-beautifulsoup-library/

โ‡ฑ BeautifulSoup Library | Web Scraping With Python: BeautifulSoup Library


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Web Scraping With Python: BeautifulSoup Library

Harika Last Updated : 27 Aug, 2021
5 min read

Intro:

According to the experts, 80 percent of all global data is unstructured. It could be photographs, documents, audio and video recordings, and web content. To make use of the information contained in it, we need to extract it and find patterns/draw useful insights. But how do we get that unstructured data into a structured format? This is where Web Scraping comes into the picture.

What is Web Scraping:

In simple terms, Web scraping, web harvesting, or web data extraction is an automated process of collecting large data(unstructured) from websites. The user can extract all the data on particular sites or the specific data as per the requirement. The data collected can be stored in a structured format for further analysis.

Uses of Web Scraping:

In todayโ€™s world, web scraping has gained a lot of attention and has a wide range of uses. A few of them are listed below:

  1. Social Media Sentiment Analysis
  2. Lead Generation in Marketing Domain
  3. Market Analysis, Online Price Comparison in eCommerce Domain
  4. Collect train and test data in Machine Learning Applications

Steps involved in web scraping:

  1. Find the URL of the webpage that you want to scrape
  2. Select the particular elements by inspecting
  3. Write the code to get the content of the selected elements
  4. Store the data in the required format

Itโ€™s that simple guys..!!

The popular libraries/tools used for web scraping are:

  • Selenium โ€“ a framework for testing web applications
  • BeautifulSoup โ€“ Python library for getting data out of HTML, XML, and other markup languages
  • Pandas โ€“ Python library for data manipulation and analysis

In this article, we will be building our own dataset by extracting Dominoโ€™s Pizza reviews from the website consumeraffairs.com/food.

We will be using requests and BeautifulSoup for scraping and parsing the data.

Step 1: Find the URL of the webpage that you want to scrape

Open the URL โ€œconsumeraffairs.com/foodโ€ and search for Dominoโ€™s Pizza in the search bar and hit Enter.

Below is how our reviews page looks like.

Step 1.1: Defining the Base URL, Query parameters

Base URL is the consistent part of your web address and represents the path to the websiteโ€™s search functionality.

base_url = "https://www.consumeraffairs.com/food/dominos.html?page="

Query parameters represent additional values that can be declared on the page.

query_parameter = "?page="+str(i) # i represents the page number
๐Ÿ‘ Defining the Base URL beautifulsoup library
URL = Base URL + Query Parameter (Image by Author)

Step 2: Select the particular elements by inspecting

Below is an image of a sample review. Each review has many elements: the rating given by the user, username, review date, and the review text along with some information about how many people liked it.

Our interest is to extract only the review text. For that, we need to Inspect the page and obtain the HTML tags, attribute names of the target element.

To inspect a web page, right-click on the page, select Inspect, or use the keyboard shortcut Ctrl+Shift+I.

In our case, the review text is stored in the HTML <p> tag of the div with the class name โ€œrvw-bdโ€œ

๐Ÿ‘ text is stored in the HTML
Inspecting the target elements

With this, we got familiar with the webpage. Letโ€™s quickly jump into the scraping.

Step 3: Write the code to get the content of the selected elements

Begin with installing the necessary modules/packages

pip install pandas requests BeautifulSoup4

Import necessary libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

pandas โ€“ to create a dataframe
requests โ€“ to send HTTP requests and access the HTML content from the target webpage
BeautifulSoup โ€“ is a Python Library for parsing structured HTML data

Create an empty list to store all the scraped reviews

all_pages_reviews = []

define a scraper function

def scraper():

Inside the scraper function, write a for loop to loop through the number of pages you would like to scrape. I would like to scrape the reviews of five pages.

for i in range(1,6):

Creating an empty list to store the reviews of each page(from 1 to 5)

pagewise_reviews = []

Construct the URL

url = base_url + query_parameter

Send HTTP request to the URL using requests and store the response

response = requests.get(url)

Create a soup object and parse the HTML page

soup = bs(response.content, 'html.parser')

Find all the div elements of class name โ€œrvw-bdโ€ and store them in a variable

rev_div = soup.findAll("div",attrs={"class","rvw-bd"})

Loop through all the rev_div and append the review text to the pagewise_reviews list

for j in range(len(rev_div)):
			# finding all the p tags to fetch only the review text
			pagewise_reviews.append(rev_div[j].find("p").text)

Append all pagewise review to a single list โ€œall_pages_reviewsโ€

for k in range(len(pagewise_reviews)):
 all_pages_reviews.append(pagewise_reviews[k])

At the end of the function, return the final list of reviews

return all_pages_reviews
Call the function scraper() and store the output to a variable 'reviews'
# Driver code
reviews = scraper()

Step 4: Store the data in the required format

4.1 storing to a pandas dataframe

i = range(1, len(reviews)+1)
reviews_df = pd.DataFrame({'review':reviews}, index=i)
Now let us take a glance of our dataset
print(reviews_df)

4.2 Writing the content of the data frame to a text file

reviews_df.to_csv('reviews.txt', sep='t')

With this, we are done with extracting the reviews and storing them in a text file. Mmm, itโ€™s pretty simple, isnโ€™t it?

Complete Python Code:

# !pip install pandas requests BeautifulSoup4 
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
base_url = "https://www.consumeraffairs.com/food/dominos.html"
all_pages_reviews =[]
  1. def scraper():
     for i in range(1,6): # fetching reviews from five pages
     pagewise_reviews = [] 
     query_parameter = "?page="+str(i)
    	url = base_url + query_parameter
    	response = requests.get(url)
    	soup = bs(response.content, 'html.parser') 
    	rev_div = soup.findAll("div",attrs={"class","rvw-bd"}) 
    
     for j in range(len(rev_div)):
     # finding all the p tags to fetch only the review text
     pagewise_reviews.append(rev_div[j].find("p").text)
    
     for k in range(len(pagewise_reviews)):
     all_pages_reviews.append(pagewise_reviews[k]) 
     return all_pages_reviews
    
    # Driver code
    reviews = scraper()
    i = range(1, len(reviews)+1)
    reviews_df = pd.DataFrame({'review':reviews}, index=i)
    reviews_df.to_csv('reviews.txt', sep='t')

End Notes:

By the end of this article, we have learned the step-by-step process of extracting content from any given web page and storing them in a text file.

  • inspect the target element using the browserโ€™s developer tools
  • use requests to download the HTML content
  • parse the HTML content using BeautifulSoup to extract required data

We can further develop this example by scraping usernames, review text. Perform vectorization on the cleaned review text, and group the users according to the reviews written. We can use Word2Vec or CounterVectorizer to convert text to vectors and apply any of the Machine Learning clustering algorithms.

References:

BeautifulSoup library: Documentation, Video Tutorial

DataFrame to CSV

GitHub Repo Link to download the source code

I hope this blog helps understand web scraping in Python using the BeautifulSoup library. Happy learning !! ๐Ÿ˜Š

The media shown in this article are not owned by Analytics Vidhya and are used at the Authorโ€™s discretion.

Hi, my name is Harika. I am a Data Engineer and I thrive on creating innovative solutions and improving user experiences. My passion lies in leveraging data to drive innovation and create meaningful impact.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Brintik Majumder

It is a really helpful document! Hence I thought about downloading this as a pdf. After downloading, the codes in the "complete python code" part, didn't come out as they are on the website. There were no line breaks after any code line. Please fix this if possible. Thank you.

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner