![]() |
VOOZH | about |
Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. For example:
Input: https://www.geeksforgeeks.org/courses
Output:
Protocol: https
Hostname: geeksforgeeks.org
Let’s explore different methods to parse and process a URL in Python using Regex.
"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.
['https'] ['geeksforgeeks.org']
Explanation:
When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.
['file']
['localhost']
[('localhost', ':4040', '4040')]
Explanation:
This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.
[('http', 'www.example.com', 'index', 'html')]
Explanation: