VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/08/a-simple-introduction-to-web-scraping-with-beautiful-soup/

โ‡ฑ A Simple Introduction to Web Scraping with Beautiful Soup


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

A Simple Introduction to Web Scraping with Beautiful Soup

Eugenia Last Updated : 13 Aug, 2021
5 min read
This article was published as a part of the Data Science Blogathon

๐Ÿ‘ web scraping beautiful soup | html
Illustration by Author

Disclaimer: The goal of this post is only educational. Web Scraping is not encouraged, especially when there are terms and conditions against such actions.

The post is the fourth in a series of tutorials to build scrapers. Below, there is the full series:

  1. HTML basics for web scraping
  2. Web Scraping with Octoparse
  3. Web Scraping with Selenium
  4. Web Scraping with Beautiful Soup (this post)

The purpose of this series is to learn to extract data from websites. Most of the data in websites are in HTML format, then the first tutorial explains the basics of this markup language. The second guide shows a way to scrape data easily using an intuitive web scraping tool, which doesnโ€™t need any knowledge of HTML. Instead, the last tutorials are focused on gathering data with Python from the web. In this case, you need to grasp to interact directly with HTML pages and you need some previous knowledge of it.

Web scraping is the process of collecting data from the web page and store it in a structured format, such as a CSV file. For example, if you want to predict the Amazon product reviewโ€™s ratings, you could be interested in gathering information about that product on the official website.

You surely arenโ€™t allowed to scrape data from all the websites. I recommend you first look at the robots.txt file to avoid legal implications. You only have to add โ€˜/robots.txtโ€™ at the end of the URL to check the sections of the website allowed/not allowed.

As an example, I am going to parse a web page using two Python libraries, Requests and Beautiful Soup. The list of countries by greenhouse gas emissions will be extracted from Wikipedia as in the previous tutorials of the series.

Table of Content:

  1. Import libraries
  2. Create response object
  3. Create a Beautiful Soup object
  4. Explore HTML tree
  5. Extract elements of the table

1. Import libraries

The first step of the tutorial is to check if all the required libraries are installed:

!pip install beautifulsoup4
!pip install requests

Once we terminated to look, we need to import the libraries:

Letโ€™s import:

from bs4 import BeautifulSoup 
import requests
import pandas as pd

Beautiful Soup is a library useful to extract data from HTML and XML files. A sort of parse tree is built for the parsed page. Indeed, an HTML document is composed of a tree of tags. I will show an example of HTML code to make you grasp this concept.

<!DOCTYPE html>
<html>
<head>
<title>Tutorial of Web scraping</title>
</head>
<body>
<h1>1. Import libraries</h1>
<p>Let's import: </p>
</body>
</html>
๐Ÿ‘ example html code | web scraping beautiful soup
Illustration by Author

Since the HTML has a tree structure, there are also ancestors, descendants, parents, children and siblings.

2. Create Response Object

To get the web page, the first step is to create a response object, passing the URL to the get method.

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions'
req = requests.get(url)
print(req)
# <Response[200]>
๐Ÿ‘ request response objects | web scraping beautiful soup
Request-Response Protocol. Illustration by Author.

This operation can seem mysterious, but with a simple image, I show how it works. The client communicates with the server using a HyperText Transfer Protocol(HTTP). In this line of code, itโ€™s like when we type the link on the address bar, the browser transmits the request to the server and then the server performs the requested action after it looked at the request.

3. Create a Beautiful Soup object

Letโ€™s create the Beautiful Soup object, which parses the document using the HTML parser. In this way, we transform the HTML code into a tree of Python objects, as I showed before in the illustration.

soup = BeautifulSoup(req.text,"html.parser")
print(soup)
๐Ÿ‘ print soup object |web scraping beautiful soup

If you print the object, youโ€™ll see all the HTML code of the web page.

4. Explore HTML tree

As you can observe, this tree contains many tags, which contain different types of information. We can get access directly to the tags, just writing:

soup.head
soup.body
soup.body.h1
#<h1 class="firstHeading" id="firstHeading">List of countries by #greenhouse gas emissions</h1>

A more efficient way is to use the find and find_all methods, which filter the element(s in case of find_all method).

row1 = tab.find('tr')
print(row1)
๐Ÿ‘ row 1

Using the find method, we zoom a part of the document within the

tags, which are used to build each row of the table. In this case, we got only the first row because the function extracts only one element. Instead, if we want to gather all the rows of the table, we use the other method:

rows = tab.find_all('tr')
print(len(rows))
print(rows[0])

We obtained a list with 187 elements. If we show the first item, weโ€™ll see the same output as before. find_all method is useful when we need to zoom in on more parts with the same tag within the document.

5. Extract elements of the table

To store all the elements, we create a dictionary, which will contain only the names of the columns as keys and empty lists as values.

rows = tab.find_all('tr')
cols = [t.text.rstrip() for t in rows[0].find_all('th')]
diz = {c:[] for c in cols}
print(diz)

The first row of the table contains only the headlines, while the rest of the rows constitute the body. To see the HTML code of specific elements, you need to put the mouse pointer in that point and select with a right-click โ€œInspectโ€.

So, we iterate over the rows of the table, excluding the first:

for r in rows[1:]:
 diz[cols[0]].append(r.find('th').text.
 replace('xa0', '').rstrip())
 row_other = r.find_all('td')
 for idx,c in enumerate(row_other):
 cell_text = c.text.replace('xa0', '').rstrip()
 diz[cols[idx+1]].append(cell_text)

The first column is always contained within the

tags, while the other columns are within the

tags. To avoid having โ€œnโ€ and โ€œxa0โ€, we use respectively the rstrip and replace functions.

In this way, we extract all the data contained in the table and save it into a dictionary. Now, we can transform the dictionary into a pandas DataFrame and export it into a CSV file:

df = pd.DataFrame(diz)
df.head()
df.to_csv('tableghg.csv')
๐Ÿ‘ csv file

Finally, we can have an overview of the table obtained. Isnโ€™t it amazing? And I didnโ€™t write many lines of code.

Final thoughts

I hope you found useful this tutorial. Beautiful Soup can be the right tool for you when the project is small. On the other hand, if you have to deal with more complex items in a web page, such as Javascript elements, you should opt for another scraper, Selenium. In the last case, itโ€™s better to check the third tutorial of the series. Thanks for reading. Have a nice day!

The media shown in this article are not owned by Analytics Vidhya and are used at the Authorโ€™s discretion.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Charles Brauer

You can stop writing articles about Screen Scraping. Those days are over. Yahoo has found a way to block screen scrapers. For example, url = 'https://finance.yahoo.com/quote/AA/profile?p=AA' returns a response of 404. Charles

Also, a great way of screen scraping is to delegate your scraping task to third-party supplier like e-scraper.com

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner