A python library for automating website interaction and scaping! But what exactly is new in the MechanicalSoup which we didn’t cover in Beautiful Soup.
MechanicalSoup is a python package that automatically stores and sends cookies, follows redirects, and also can follow hyperlinks and forms in a webpage. It was created by M Hickford. He was always amazed by the Mechanize library. Mechanize was a John J. Lee project which enables programmatic web browsing in Python, and it was later taken over by Kovid Goyal in 2017.
Some of the features of Mechanize was:
- mechanize.browser: which uses urllib2.OpenerDirector to open any Url on the internet
- Easily filling HTML form
- Automatically observing robots.text
- Automatically handling HTTP-Equiv
- Browser .back() and .reload() method.
Unfortunately, Mechanize is not compatible with Python3 programming language, and also its development stalled for many years.
So Hickford came up with a solution: MechanicalSoup, which provides the same API, built on Python Requests(for better HTTP sessions), and BeautifulSoup(for data navigation). Since 2017 this project has been maintained by Hemberger and moy.
MechanicalSoup is designed to mimic the behavior of how humans interact with web browsers. Some of the possible use-cases include:
- Interacting with websites that don’t provide API
- Testing a website, you’re currently developing.
- Browsing interface
Installation
You can install MechanicalSoup using PyPI (python package manager).
Mechanical soup install command will download BeautifulSoup, requests, six, Urllib, and other libraries.
pip install MechanicalSoup
Or, you can download and install the development version from GitHub:
pip install git https://github.com/MechanicalSoup/MechanicalSoup
Or. Installing from source (installs the version in the current working directory):
git clone https://github.com/MechanicalSoup/MechanicalSoup.git
cd MechanicalSoup
python setup.py install
Note: git must be installed to use the above command.
The additional library we are going to install:
pip install wget
Quickstart
Testing our library for any errors
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
url = “http://httpbin.org/”
browser.open(url)
print(browser.get_url())
MechanicaSoup functions Explanation:
- Here first, we imported the mechanical soup library.
- mechanicalsoup.StatefulBrowser() will create a browser object. It is an extension of the browser that stores the browser’s state.
- browser.open() will open the webpage we want in the background and return a response[200] which is a return value of open(). MechanicalSoup uses the requests library to make HTTP requests on websites, so no worries, we are getting such a return output.
- browser.get_url() used to get the URL it uses requests framework too.
- Furthermore, we can do some other things with MechanicalSoup like following the subdomains as follows:
browser.follow_link("forms")
browser.get_url()
On passing regular expressions “forms” to follow_link(), which followed the link whose text matched with the given expression i.e. forms and get_url() will return the new URL.
- Now we are on a new domain http://httpbin.org/forms/post, Let’s extract the page content:
browser.get_current_page()
It will return the current page source code like the beautifulsoup prettify function because get_current_page() is bs4.BeautifulSoup. MechanicalSoup uses beautiful soup for data extraction.
- You can find any tag by using the following command:
browser.get_current_page().find_all('legend')
find_all() tag navigate threw the given tag in this case “legend,” and it will exclude the rest of the part from the source code.
- We can also fill the forms and POST request using MechanicalSoup by using the following command:
browser.select_form('form[action="/post"]')
browser.get_current_form().print_summary()
- Here select_form() is a CSS selector, we selected the HTML tag name form and action is /POST.
- print_summary will print all the available form placeholder where we can do post request.
- For filling the form we can use the following commands:
browser["custname"] = "Mohit"
browser["custtel"] = "9081515151"
browser["custemail"] = "mohitmaithani@aol.com"
browser["comments"] = "please make pizza dough more soft"
browser["size"] = "large"
browser["topping"] = "mushroom"
#launch browser
browser.launch_browser()
- browser[“placeholder”]= “text” is used to fill the form.
- browser.launch_browser() will show the realtime result.
Now we have already discussed all the MechanicalSoup essential functions if you need some more information about the API support, you can go here.
Let’s scrape Cats’ images from the internet using MechanicalSoup and create our custom dataset!
It’s a good use-case. The very first step of every data science project is to create or collect data, and then further processing, cleaning, analysis, modeling, and tuning part comes. Now, as we already familiar with essential API, let’s straight jump to code:
- Search cats on Google Images
We are setting the google search query and making it open in the browser with search text cat.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
url = "https://www.google.com/imghp?hl=en"
browser.open(url)
#get HTML
browser.get_current_page()
#target the search input
browser.select_form()
browser.get_current_form().print_summary()
#search for a term
search_term = 'cat'
browser["q"] = search_term
#submit/"click" search
browser.launch_browser()
response = browser.submit_selected()
print('new url:', browser.get_url())
print('response:\n', response.text[:500])
- Navigate to the new pages and target all the images, it will return the output as URLs list.
#open URL
new_url = browser.get_url()
browser.open(new_url)
#get HTML code
page = browser.get_current_page()
all_images = page.find_all('img')
#target the source attributes of image
image_source = []
for image in all_images:
image = image.get('src')
image_source.append(image)
Image_source[5:25]
- Let’s fix the corrupted URLs
Python startswith function to filter the URLs not having HTTPS
#save cleaned links in "image_source"
image_source = [image for image in image_source if image.startswith('https')]
print(image_source)
- Create a local repo to store cat images.
import os
path = os.getcwd()
path = os.path.join(path, search_term + "s")
#create the directory
os.mkdir(path)
#print path where cats images are going to save
path
- Download using wget
##download wget by uncommenting below line
#pip install wget
##download images
counter = 0
for image in image_source:
save_as = os.path.join(path, search_term + str(counter) + '.jpg')
wget.download(image, save_as)
counter += 1
Output:
A full dataset of cats images in out local computer ready for further data science prediction/analysis. ????
You can find .ipynb(python notebook) here. It contains all the codes used in this tutorial to MechanicalSoup.
Did you notice that MechanicalSoup is a composition of Requests, BeautifulSoup, and also having Selenium realtime browser surfing capabilities?
Final Thoughts
Indeed MechanicalSoup is a powerful multipurpose automation and web scraping tool, as we know; Mechanize is a base library on the basis of which MechanicalSoup has been also created MechanicalSoup is a composition of BeautifulSoup and Requests it can also act like selenium and show us results in a web browser but with lighter processing.
MechanicalSoup is actively maintained by developers, and it is a very popular and easy to use framework.
Projects using MechanicalSoup framework
- PatZilla: modular patent information research platform and data integration toolkit.
- gmusicapi: Unofficial library for Google Music.
- Chamilotools: Chamilotools is a set of tools to interact with a Chamilo server without using your browser.