Last Updated: December 9, 2020
In AI Mysteries

A Deep Dive Into Web Scraping Using MechanicalSoup

Share

by Mohit Maithani

A python library for automating website interaction and scaping! But what exactly is new in the MechanicalSoup which we didn’t cover in Beautiful Soup.

MechanicalSoup is a python package that automatically stores and sends cookies, follows redirects, and also can follow hyperlinks and forms in a webpage. It was created by M Hickford. He was always amazed by the Mechanize library. Mechanize was a John J. Lee project which enables programmatic web browsing in Python, and it was later taken over by Kovid Goyal in 2017.

Some of the features of Mechanize was:

mechanize.browser: which uses urllib2.OpenerDirector to open any Url on the internet
Easily filling HTML form
Automatically observing robots.text
Automatically handling HTTP-Equiv
Browser .back() and .reload() method.

Unfortunately, Mechanize is not compatible with Python3 programming language, and also its development stalled for many years.

So Hickford came up with a solution: MechanicalSoup, which provides the same API, built on Python Requests(for better HTTP sessions), and BeautifulSoup(for data navigation). Since 2017 this project has been maintained by Hemberger and moy.

MechanicalSoup is designed to mimic the behavior of how humans interact with web browsers. Some of the possible use-cases include:

Interacting with websites that don’t provide API
Testing a website, you’re currently developing.
Browsing interface

Installation

You can install MechanicalSoup using PyPI (python package manager).

Mechanical soup install command will download BeautifulSoup, requests, six, Urllib, and other libraries.

pip install MechanicalSoup

Or, you can download and install the development version from GitHub:

pip install git https://github.com/MechanicalSoup/MechanicalSoup

Or. Installing from source (installs the version in the current working directory):

git clone https://github.com/MechanicalSoup/MechanicalSoup.git
cd MechanicalSoup
python setup.py install

Note: git must be installed to use the above command.

The additional library we are going to install:

pip install wget

Quickstart

Testing our library for any errors

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
url = “http://httpbin.org/”

browser.open(url)
print(browser.get_url())

MechanicaSoup functions Explanation:

Here first, we imported the mechanical soup library.
mechanicalsoup.StatefulBrowser() will create a browser object. It is an extension of the browser that stores the browser’s state.
browser.open() will open the webpage we want in the background and return a response[200] which is a return value of open(). MechanicalSoup uses the requests library to make HTTP requests on websites, so no worries, we are getting such a return output.
browser.get_url() used to get the URL it uses requests framework too.

Furthermore, we can do some other things with MechanicalSoup like following the subdomains as follows:

browser.follow_link("forms")
browser.get_url()

browser.follow_link() in mechanical soup

On passing regular expressions “forms” to follow_link(), which followed the link whose text matched with the given expression i.e. forms and get_url() will return the new URL.

Now we are on a new domain http://httpbin.org/forms/post, Let’s extract the page content:

browser.get_current_page()

It will return the current page source code like the beautifulsoup prettify function because get_current_page() is bs4.BeautifulSoup. MechanicalSoup uses beautiful soup for data extraction.

You can find any tag by using the following command:

browser.get_current_page().find_all('legend')

find_all() tag navigate threw the given tag in this case “legend,” and it will exclude the rest of the part from the source code.

We can also fill the forms and POST request using MechanicalSoup by using the following command:

browser.select_form('form[action="/post"]')
browser.get_current_form().print_summary()

Here select_form() is a CSS selector, we selected the HTML tag name form and action is /POST.
print_summary will print all the available form placeholder where we can do post request.

For filling the form we can use the following commands:

browser["custname"] = "Mohit"
browser["custtel"] = "9081515151"
browser["custemail"] = "mohitmaithani@aol.com"
browser["comments"] = "please make pizza dough more soft"
browser["size"] = "large"
browser["topping"] = "mushroom"
 
#launch browser
browser.launch_browser()

browser[“placeholder”]= “text” is used to fill the form.
browser.launch_browser() will show the realtime result.

Now we have already discussed all the MechanicalSoup essential functions if you need some more information about the API support, you can go here.

Let’s scrape Cats’ images from the internet using MechanicalSoup and create our custom dataset!

It’s a good use-case. The very first step of every data science project is to create or collect data, and then further processing, cleaning, analysis, modeling, and tuning part comes. Now, as we already familiar with essential API, let’s straight jump to code:

Search cats on Google Images

We are setting the google search query and making it open in the browser with search text cat.

import mechanicalsoup
 
browser = mechanicalsoup.StatefulBrowser()
url = "https://www.google.com/imghp?hl=en"
 
browser.open(url)
 
#get HTML
browser.get_current_page()
 
#target the search input
browser.select_form()
browser.get_current_form().print_summary()
 
#search for a term
search_term = 'cat'
browser["q"] = search_term 
 
#submit/"click" search
browser.launch_browser()
response = browser.submit_selected()
 
print('new url:', browser.get_url())
print('response:\n', response.text[:500])

Navigate to the new pages and target all the images, it will return the output as URLs list.

#open URL
new_url = browser.get_url()
browser.open(new_url)
 
#get HTML code
page = browser.get_current_page()
all_images = page.find_all('img')
 
#target the source attributes of image
image_source = []
for image in all_images:
    image = image.get('src')
    image_source.append(image)
 
Image_source[5:25]

Let’s fix the corrupted URLs

Python startswith function to filter the URLs not having HTTPS

#save cleaned links in "image_source"
image_source = [image for image in image_source if image.startswith('https')]

print(image_source)

link cleaning start with https in python only show links having http — print all the links starts with “https”

Create a local repo to store cat images.

import os
 
path = os.getcwd()
path = os.path.join(path, search_term + "s")
 
#create the directory
os.mkdir(path)
#print path where cats images are going to save
path

python wget to download images from links.

Download using wget

##download wget by uncommenting below line
#pip install wget  

##download images
counter = 0
for image in image_source:
    save_as = os.path.join(path, search_term + str(counter) + '.jpg')
    wget.download(image, save_as)
    counter += 1

Output:

A full dataset of cats images in out local computer ready for further data science prediction/analysis. ????

cats image scraped from internet custom dataset

You can find .ipynb(python notebook) here. It contains all the codes used in this tutorial to MechanicalSoup.

Did you notice that MechanicalSoup is a composition of Requests, BeautifulSoup, and also having Selenium realtime browser surfing capabilities?

Final Thoughts

Indeed MechanicalSoup is a powerful multipurpose automation and web scraping tool, as we know; Mechanize is a base library on the basis of which MechanicalSoup has been also created MechanicalSoup is a composition of BeautifulSoup and Requests it can also act like selenium and show us results in a web browser but with lighter processing.

MechanicalSoup is actively maintained by developers, and it is a very popular and easy to use framework.

Projects using MechanicalSoup framework

PatZilla: modular patent information research platform and data integration toolkit.
gmusicapi: Unofficial library for Google Music.
Chamilotools: Chamilotools is a set of tools to interact with a Chamilo server without using your browser.

📣 Want to advertise in AIM? Book here