Web scraping provides a way for people & businesses to understand what one can achieve from a fair amount of data, you can challenge your competitors and surpass them just by doing a good data analysis and research on scraped data from the web. Maybe, from an individual perspective too, if you are looking for a job; Automated web scraping can help you process all the jobs posted on the internet in your spreadsheet where you can filter them out by your skills and experience. In contrast, when you spend hours to get the information you want, now it’s easy for you to create your web scraping script that can work like a charm for all your manual hours of labor.
There is so much information on the internet, and new data is generated every second, so manually scraping and researching is not possible, That’s why we need automated web scraping to accomplish our goals.
Web scraping became an essential part of every business, individual, and even government.
Challenges
There are some challenges too in the web scraping domain like the continuous change in websites so our web scraper may not work for next time, that’s why people came up with a more diverse approach like Diffbot: A visual-based web scraping tool which uses computer vision, machine learning, and NLP combined to achieve a universal web scraping techniques that are more powerful, accurate, and easy to use.
Another problem faced in web scraping is the variety all web sites are different in design and coding structure so we can’t use a single web scraping script everywhere to achieve results. There must be a continuous change in code as the website changes.
Today we are going to discuss some of the libraries that can reduce your web scraper building time and are essentials for web scraping purposes, as they are the building blocks on which everything is built.
Urllib
Urllib is a package that combines several modules to preprocess the URLs. In simple words, it is an HTTP client for python programming languages, the latest version of
Urllib is urllib3 1.26.2 which supports thread-safe connection, connection pooling, client-side verification using SSL/TLS verification, multipart encoding, support for gzip, and brotli encoding. It brings many critical features that are missing from traditional python libraries.
Urllib3 is one of the widely downloaded packages on PyPi, and it is the first to execute in any web scraping script, it is available under the MIT license.
By using urllib.request we can simply open and read URLs.
urllib.error defines the exceptions and errors raised by the urllib.request command.
urllib.parse is used for parsing URLs.
urllib.robotparser is used for parsing robots.txt files.
Installation
pip install urllib3
Alternatively, you can install it from the source code:
git clone git://github.com/urllib3/urllib3.git
python setup.py install
Quickstart
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/robots.txt')
r.status
R.data
Output
Let’s scrape a website using urllib and regular expressions
#1 libraries needed
import urllib.request
import urllib.parse
import re
#2 search
url = 'https://analyticsindiamag.com/'
values = {'s':'Web Scraping',
'submit':'search'}
#3 parse
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
resp = urllib.request.urlopen(req)
respData = resp.read()
#4 extract using regular expressions
document = re.findall(r'<p>(.*?)</p>',str(respData))
for line in document:
print(line)
We can easily fetch data without using any other module; just urllib and re (regular expression)
Let’s understand the code explained above:
- First, we imported the required module, i.e. re and urllib
- Defined a Url, i.e. Analytics India Magazine, and some testing search values we want to extract.
- In the first line, we encode the search values, and then we encode the data so it can be understood by the machine.
- In the third line, we requested values from the predefined Url we parsed earlier.
- Next urlopen() used to open the HTML document.
- read() is used to read that document line by line
- We are using a python module for using regular expressions to find values. In this case, our regular expressions are scraping all the data, which is in the paragraph tag.
We can use a span tag in the regular expression findall function instead, to extract all the titles of the article’s name as we did in this BeautifulSoup tutorial. But now with just the help of the two lightest modules urllib and re.
Requests
Requests is an open-source python library that makes HTTP requests more human-friendly and simple to use. It is developed by Kenneth Reitz, Cory Benfield, Ian Stapleton Cordasco, Nate Prewitt with an initial release in February 2011.
Requests module library is Apache2 licensed, which is written in Python.
Sounds pretty much the same as urllib, then why do we need it?
Because requests support a fully restful API, and it is easier to use and access.
Even though the requests library is powered by urllib3 it is still used more nowadays because of its readability and POST/GET freedom and many more.
Also, the urllib API is thoroughly broken, it was built for a different time and different web structure, Urllib requires more amount of work then Requests for the simplest task; So, now we need a more flexible HTTP client, i.e. requests
Advantages:
- requests library is easy to use and fetch information
- It is extensively used for scraping data from website
- It is also used for web API request purpose
- Using request we can GET, POST, PUT, and delete the data for the given URL
- It has an authentication module support
- It handles cookies and sessions very firmly.
Features:
- International Domains and URLs access capabilities.
- SSL verification
- JSON decoder
- .netrc support
- Multiple file uploads
- Thread safety
- Unicode response body
Installation
import requests
resp = requests.get('http://www.yourwebsite.com/user')
resp = requests.post('http://www.yourwebsite.com/user')
resp = requests.put('http://www.yourwebsite.com/user/put')
resp = requests.delete('http://www.yourwebsite.com/user/delete')
You don’t need to encode parameters like urllib3, just pass a dictionary as an argument, and you are good to go:
attributes = {"firstname": "John", "lastname": "Edison", "password": "jeddie123"}
resp = requests.post('http://www.yourwebsite.com/user', data=attributes)
It also has its own JSON decoder:
resp.json()
Or If the response is text; Use :
resp.text
Web scraping with ‘requests’
We are using requests and beautiful soup for processing and finding information or we can use regular expressions as shown in the above demonstration of urllib3.
For this demonstration, we are using requests with beautiful soup, and we are scraping the articles from a website
#1 importing modules
import requests
from bs4 import BeautifulSoup
#2 using .GET()
res = requests.get('https://analyticsindiamag.com/')
#3 beautiful for extracting only reliable data
soup = BeautifulSoup(res.text, 'html.parser')
article_block =soup.find_all('div',class_='post-title')
for titles in article_block:
title =titles.find('span').get_text()
print(title)
Explanation
- Imported requests and Beautiful Soup
- requests.get() performs an HTTP request to the given URL. It returns the HTML data
- Beautiful Soup will parse that data using its HTML parser, and then further operation like findall will be executed for class post-title, more about BeautifulSoup you can find here
Use- case of Request other than Web scraping
We can use the requests module to request our web API to get answers like in this case we are using POST on web API: https://loan5.herokuapp.com/api
This API is used to predict loan approval. It returns 1 or 0, i.e. approved or disapproved on passing some attributes like gender, credit history, married, etc.
#1
import json
import requests
url= 'https://loan5.herokuapp.com/api'
#2 sample data
data={'Gender':1, 'Married':1, 'Dependents':2, 'Education':0, 'Self_Employed':1,'Credit_History':0,'Property_Area':1, 'Income':1}
data = json.dumps(data)
#3 sending requesting with data to webapi and it will
Return the answer.
send_req = requests.post(url, data)
print(send_req.json())
Conclusion
We have learned how the urllib and request two python modules can help in web scraping from scratch, there are many ways to execute your web scraper like in the previous article we used selenium for web scraping then we combined selenium with beautiful soup and now we have integrated request module instead of selenium with beautiful soup.
It all depends on the use-case we are having if your scraper needs some HTTP & web API communication then you must start fetching your URLs with a request otherwise if you want to use a scraper that runs in real-time while scraping you can adopt selenium.