VOOZH about

URL: https://www.geeksforgeeks.org/web-scraping/how-to-build-web-scraping-bot-in-python/

⇱ How to Build Web scraping bot in Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

How to Build Web scraping bot in Python

Last Updated : 23 Jul, 2025

In this article, we are going to see how to build a web scraping bot in Python.

Web Scraping is a process of extracting data from websites. A Bot is a piece of code that will automate our task. Therefore, A web scraping bot is a program that will automatically scrape a website for data, based on our requirements.    

Module needed

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

  • requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

  • Selenium: Selenium is one of the most popular automation testing tools. It can be used to automate browsers like Chrome, Firefox, Safari, etc.

pip install selenium

Method 1: Using Selenium

We need to install a chrome driver to automate using selenium, our task is to create a bot that will be continuously scraping the google news website and display all the headlines every 10mins.

Stepwise implementation:

Step 1: First we will import some required modules.

Step 2: The next step is to open the required website.

Output:

👁 Image

Step 3: Extracting the news title from the webpage, to extract a specific part of the page, we need its XPath, which can be accessed by right-clicking on the required element and selecting Inspect in the dropdown bar. 

👁 Image

After clicking Inspect a window appears. From there, we have to copy the elements full XPath to access it:

👁 Image

Note: You might not always get the exact element that you want by inspecting (depends on the structure of the website), so you may have to surf the HTML code for a while to get the exact element you want.  And now, just copy that path and paste that into your code. After running all these lines of code, you will get the title of the first heading printed on your terminal.

Output:

'Attack on Afghan territory': Taliban on US airstrike that killed 2 ISIS-K men

Step 4: Now, the target is to get the X_Paths of all the headlines present. 

One way is that we can copy all the XPaths of all the headlines (about 6 headlines will be there in google news every time) and we can fetch all those, but that method is not suited if there are a large number of things to be scrapped. So, the elegant way is to find the pattern of the XPaths of the titles which will make our tasks way easier and efficient.  Below are the XPaths of all the headlines on the website, and let's figure out the pattern.

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[3]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[4]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[5]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[6]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[7]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[8]/div/div/article/h3/a

So, by seeing these XPath's, we can see that only the 5th div is changing (bolded ones). So based upon this, we can generate the XPaths of all the headlines. We will get all the titles from the page by accessing them with their XPath. So to extract all these, we have the code as 

Output:

👁 Image

Now, the code is almost complete, the last thing we have to do is that the code should get headlines for every 10 mins. So we will run a while loop and sleep for 10 mins after getting all the headlines.

Below is the full implementation

Output:

👁 Image

Method 2: Using Requests and BeautifulSoup

The requests module gets the raw HTML data from websites and beautiful soup is used to parse that information clearly to get the exact data we require. Unlike Selenium, there is no browser installation involved and it is even lighter because it directly accesses the web without the help of a browser.

Stepwise implementation:

Step 1: Import module.

Step 2: The next thing to do is to get the URL data and then parse the HTML code

Step 3: First, we shall get all the headings from the table.

Output:

👁 Image

Step 4: In the same way, all the values in each row can be obtained

Output:

👁 Image

Below is the full implementation:

Output:

👁 Image

Hosting the Bot

This is a specific method, used to run the bot continuously online without the need for any human intervention.  replit.com is an online compiler, where we will be running the code. We will be creating a mini webserver with the help of a flask module in python that helps in the continuous running of the code. Please create an account on that website and create a new repl.

👁 Image

After creating the repl, Create two files, one to run the bot code and the other to create the web server using flask.

Code for cryptotracker.py:

Code for the keep_alive.py (webserver):

Keep-alive is a method in networking that is used to prevent a certain link from breaking. Here the purpose of the keep-alive code is to create a web server using flask, that will keep the thread of the code (crypto-tracker code) to be active so that it can give the updates continuously.

👁 Image

Now, we have a web server create, and now, we need something to ping it continuously so that the server does not go down and the code keeps on running continuously. There is a website uptimerobot.com that does this job. Create an account in it 

👁 Image

Running the Crypto tracker code in Replit. Thus, We have successfully created a web scraping bot that will scrap the particular website continuously for every 10 mins and print the data to the terminal.

Comment

Explore