![]() |
VOOZH | about |
When scraping data from websites, we often face different types of errors. Some are caused by incorrect URLs, server issues or incorrect usage of scraping libraries like requests and BeautifulSoup.
In this tutorial, we’ll explore some common exceptions encountered during web scraping and how to handle them.
An HTTPError occurs when the server responds with an HTTP error status code, such as 404 (Not Found) or 500 (Internal Server Error).
Output:
Request successful
Explanation:
Output:
HTTP Error: 404 Client Error: Not Found for url: https://www.geeksforgeeks.org/page-that-does-not-exist/
Explanation: This URL does not exist, so a 404 Not Found error is raised.
URLError typically occurs when the URL is invalid, or there’s a network connection issue.
Note: In Python’s requests module, URLError is not directly raised- instead, requests.exceptions.ConnectionError is raised for connection failures.
Example:
Output:
Connection Error: HTTPSConnectionPool(host='thiswebsitedoesnotexist123456789.com', port=443): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x783ef8211210>: Failed to resolve 'thiswebsitedoesnotexist123456789.com' ([Errno -2] Name or service not known)"))
Explanation:
AttributeError in BeautifulSoup is raised when an invalid attribute reference is made, or when an attribute assignment fails. When we try to access the Tag using BeautifulSoup from a website and that tag is not present on that website then BeautifulSoup always gives an AttributeError.
Example:
Output:
AttributeError: 'NoneType' object has no attribute 'SomeTag'
Explanation:
When parsing invalid or incomplete XML data with BeautifulSoup, you might face parsing errors or get None or empty results when using find() or find_all().
soup = bs4.BeautifulSoup( response, ' xml ' )
or
soup = bs4.BeautifulSoup( response, ' xml -xml' )
XML parser error generally happens when we're not passing any element in the find() and find_all() function or element is missing from the document. It gives the empty bracket [] or None as their output.
Example:
Output:
None