UHG
Search
Close this search box.

Essential Of Web scraping: urllib & Requests With Python

Share

Table of Content

Web scraping provides a way for people & businesses to understand what one can achieve from a fair amount of data, you can challenge your competitors and surpass them just by doing a good data analysis and research on scraped data from the web. Maybe, from an individual perspective too, if you are looking for a job; Automated web scraping can help you process all the jobs posted on the internet in your spreadsheet where you can filter them out by your skills and experience. In contrast, when you spend hours to get the information you want, now it’s easy for you to create your web scraping script that can work like a charm for all your manual hours of labor. 

There is so much information on the internet, and new data is generated every second, so manually scraping and researching is not possible, That’s why we need automated web scraping to accomplish our goals.

Web scraping became an essential part of every business, individual, and even government.

Challenges

There are some challenges too in the web scraping domain like the continuous change in websites so our web scraper may not work for next time, that’s why people came up with a more diverse approach like Diffbot: A visual-based web scraping tool which uses computer vision, machine learning, and NLP combined to achieve a universal web scraping techniques that are more powerful, accurate, and easy to use.

Another problem faced in web scraping is the variety all web sites are different in design and coding structure so we can’t use a single web scraping script everywhere to achieve results. There must be a continuous change in code as the website changes.

Today we are going to discuss some of the libraries that can reduce your web scraper building time and are essentials for web scraping purposes, as they are the building blocks on which everything is built.

Urllib

Urllib is a package that combines several modules to preprocess the URLs. In simple words, it is an HTTP client for python programming languages, the latest version of 

Urllib is urllib3 1.26.2 which supports thread-safe connection, connection pooling, client-side verification using SSL/TLS verification, multipart encoding, support for gzip, and brotli encoding. It brings many critical features that are missing from traditional python libraries.

Urllib3 is one of the widely downloaded packages on PyPi, and it is the first to execute in any web scraping script, it is available under the MIT license.

urllib types

By using urllib.request we can simply open and read URLs.

urllib.error defines the exceptions and errors raised by the urllib.request command.

urllib.parse is used for parsing URLs.

urllib.robotparser is used for parsing robots.txt files.

Installation

pip install urllib3

Alternatively, you can install it from the source code

git clone git://github.com/urllib3/urllib3.git
python setup.py install

Quickstart

import urllib3

http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/robots.txt')
r.status
R.data

Output

output urllib web scraping

Let’s scrape a website using urllib and regular expressions

#1 libraries needed
import urllib.request
import urllib.parse 
import re 

#2 search   
url = 'https://analyticsindiamag.com/'
values = {'s':'Web Scraping', 
          'submit':'search'} 

#3 parse
data = urllib.parse.urlencode(values) 
data = data.encode('utf-8') 
req = urllib.request.Request(url, data) 
resp = urllib.request.urlopen(req) 
respData = resp.read() 

#4 extract using regular expressions
document = re.findall(r'<p>(.*?)</p>',str(respData)) 
   
for line in document: 
    print(line) 
output

We can easily fetch data without using any other module; just urllib and re (regular expression)

Let’s understand the code explained above:

  1. First, we imported the required module, i.e. re and urllib
  2. Defined a Url, i.e. Analytics India Magazine, and some testing search values we want to extract.
  3. In the first line, we encode the search values, and then we encode the data so it can be understood by the machine.
    • In the third line, we requested values from the predefined Url we parsed earlier.
    • Next urlopen() used to open the HTML document.
    • read() is used to read that document line by line
  4. We are using a python module for using regular expressions to find values. In this case, our regular expressions are scraping all the data, which is in the paragraph tag.

We can use a span tag in the regular expression findall function instead, to extract all the titles of the article’s name as we did in this BeautifulSoup tutorial. But now with just the help of the two lightest modules urllib and re.

Requests

Requests is an open-source python library that makes HTTP requests more human-friendly and simple to use. It is developed by Kenneth Reitz, Cory Benfield, Ian Stapleton Cordasco, Nate Prewitt with an initial release in February 2011.

Requests module library is Apache2 licensed, which is written in Python.

requests
requests logo

Sounds pretty much the same as urllib, then why do we need it?

Because requests support a fully restful API, and it is easier to use and access.

Even though the requests library is powered by urllib3 it is still used more nowadays because of its readability and POST/GET freedom and many more.

Also, the urllib API is thoroughly broken, it was built for a different time and different web structure, Urllib requires more amount of work then Requests for the simplest task; So, now we need a more flexible HTTP client, i.e. requests

Advantages:

  • requests library is easy to use and fetch information
  • It is extensively used for scraping data from website
  • It is also used for web API request purpose
  • Using request we can GET, POST, PUT, and delete the data for the given URL
  • It has an authentication module support
  • It handles cookies and sessions very firmly.

Features:

  • International Domains and URLs access capabilities.
  • SSL verification
  • JSON decoder
  • .netrc support
  • Multiple file uploads
  • Thread safety
  • Unicode response body

Installation

import requests

resp = requests.get('http://www.yourwebsite.com/user')
resp = requests.post('http://www.yourwebsite.com/user')
resp = requests.put('http://www.yourwebsite.com/user/put')
resp = requests.delete('http://www.yourwebsite.com/user/delete')

You don’t need to encode parameters like urllib3, just pass a dictionary as an argument, and you are good to go:

attributes = {"firstname": "John", "lastname": "Edison", "password": "jeddie123"} 
resp = requests.post('http://www.yourwebsite.com/user', data=attributes)

It also has its own JSON decoder:

resp.json()

Or If the response is text; Use :

resp.text

Web scraping with ‘requests’

We are using requests and beautiful soup for processing and finding information or we can use regular expressions as shown in the above demonstration of urllib3.

For this demonstration, we are using requests with beautiful soup, and we are scraping the articles from a website 

#1 importing modules
import requests
from bs4 import BeautifulSoup

#2 using .GET()
res = requests.get('https://analyticsindiamag.com/')

#3 beautiful for extracting only reliable data
soup = BeautifulSoup(res.text, 'html.parser')
article_block =soup.find_all('div',class_='post-title')
for titles in article_block:
	title =titles.find('span').get_text()
	print(title)
requests output
output

Explanation

  1. Imported requests and Beautiful Soup
  2. requests.get() performs an HTTP request to the given URL. It returns the HTML data
  3. Beautiful Soup will parse that data using its HTML parser, and then further operation like findall will be executed for class post-title, more about BeautifulSoup you can find here

Use- case of Request other than Web scraping

We can use the requests module to request our web API to get answers like in this case we are using POST on web API: https://loan5.herokuapp.com/api

This API is used to predict loan approval. It returns 1 or 0, i.e. approved or disapproved on passing some attributes like gender, credit history, married, etc.

#1
import json
import requests
url= 'https://loan5.herokuapp.com/api'

#2 sample data
data={'Gender':1, 'Married':1, 'Dependents':2, 'Education':0, 'Self_Employed':1,'Credit_History':0,'Property_Area':1, 'Income':1}
data = json.dumps(data)

#3 sending requesting with data to webapi and it will 
Return the answer.
send_req = requests.post(url, data)
print(send_req.json())
outputs of json.dump in requests

Conclusion

We have learned how the urllib and request two python modules can help in web scraping from scratch, there are many ways to execute your web scraper like in the previous article we used selenium for web scraping then we combined selenium with beautiful soup and now we have integrated request module instead of selenium with beautiful soup.

It all depends on the use-case we are having if your scraper needs some HTTP & web API communication then you must start fetching your URLs with a request otherwise if you want to use a scraper that runs in real-time while scraping you can adopt selenium.

📣 Want to advertise in AIM? Book here

Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.
Flagship Events
Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.