Python modules for web scraping

Python Modules for Web Scraping

In this chapter, let’s learn about various Python modules that can be used for web scraping.

Python Development Environment Using Virtualenv

Virtualenv is a tool for creating isolated Python environments. With virtualenv, we can create a folder containing all the necessary executables to use the packages our Python projects require. It also allows us to add and modify Python modules without accessing the global installation.

You can install virtualenv using the following command: —

(base) D:ProgramData>pip install virtualenv
Collecting virtualenv
Downloading
https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3
5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl
(1.9MB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s
Installing collected packages: virtualenv
Successfully installed virtualenv-16.0.0

Now, we need to create a directory that will represent the project with the help of the following command.

(base) D:ProgramData>mkdir webscrap

Now, move into that directory with the help of the following command:

(base) D:ProgramData>cd webscrap

Now, we need to initialize the virtual environment folder of our choice as shown below.

(base) D:ProgramDatawebscrap>virtualenv websc
Using base prefix 'd:programdata'
New python executable in D:ProgramDatawebscrapwebscScriptspython.exe
Installing setuptools, pip, wheel...done.

Now, activate this virtual environment with the following command. Once activated, you will see its name in the left brackets.

(base) D:ProgramDatawebscrap>webscscriptsactivate

We can install any module in this environment as shown below.

(websc) (base) D:ProgramDatawebscrap>pip install requests
Collecting requests
Downloading
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9
1kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
   Downloading
https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca
55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133
kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s
Collecting certifi>=2017.4.17 (from requests)
   Downloading
https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364
4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl
(147kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s
Collecting urllib3<1.24,>=1.21.1 (from requests)   Downloading
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl (133k
B)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s
Collecting idna<2.8,>=2.5 (from requests)   Downloading
https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746
a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s
Installing collected packages: chardet, certifi, urllib3, idna, requests
Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1
urllib3-1.23

To deactivate the virtual environment, we can use the following command: –.

(websc) (base) D:ProgramDatawebscrap>deactivate
(base) D:ProgramDatawebscrap>

You can see that (websc) has been deactivated.

Python Modules for Web Scraping

Web scraping is the process of building an agent that automatically crawls, parses, downloads, and organizes useful information from the web. In other words, web scraping software automatically loads and crawls data from multiple websites based on our requirements, rather than manually saving the data on the website.

In this section, we will discuss useful Python libraries for web scraping.

Requests

It is a simple Python web scraping library. It’s an efficient HTTP library for accessing web pages. With Requests, we can retrieve the raw HTML of a web page and then parse it to retrieve data. Before using Requests, let’s learn how to install it.

Installing Requests

We can install it in our virtual environment or globally. With the help of the pip command, we can easily install it as shown below.

(base) D:ProgramData> pip install requests
Collecting requests
Using cached
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl
Requirement already satisfied: idna<2.8,>=2.5 in d:programdatalibsite-packages
(from requests) (2.6)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in
d:programdatalibsite-packages (from requests) (1.22)
Requirement already satisfied: certifi>=2017.4.17 in d:programdatalibsitepackages
(from requests) (2018.1.18)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in
d:programdatalibsite-packages (from requests) (3.0.4)
Installing collected packages: requests
Successfully installed requests-2.19.1

Example

In this example, we are performing a GET HTTP request to a web page. To do this, we need to first import the requests library as shown below.

In [1]: import requests

In the following line of code, we use requests to make a GET HTTP request to the url:https: //authoraditiagarwal.com/.

In [2]: r = requests.get('https://authoraditiagarwal.com/')

Now we can retrieve the content by using the .text property as shown below.

In [5]: r.text[:200]

Notice that in the output below, we get the first 200 characters.

Out[5]: '<!DOCTYPE html>n<html lang="en-US"ntitemscope
ntitemtype="http://schema.org/WebSite" ntprefix="og: http://ogp.me/ns#"
>n<head>nt<meta charset
="UTF-8" />nt<meta http-equiv="X-UA-Compatible" content="IE'

Urllib3

It is another Python library that can be used to retrieve data from URLs, similar to The request library. You can read more about it in its documentation: https://urllib3.readthedocs.io/en/latest/.

Installing Urllib3

Using the pip command, we can install urllib3 in our virtual environment or globally.

(base) D:ProgramData>pip install urllib3
Collecting urllib3
Using cached
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl
Installing collected packages: urllib3
Successfully installed urllib3-1.23

Example: Scraping with Urlib3 and BeautifulSoup

In the following example, we use Urlib3 and BeautifulSoup to crawl a web page. We use Urlib3 with the requests library to retrieve the raw data (HTML) from the web page. We then use BeautifulSoup to crawl the web page. to parse the HTML data.

import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print (soup.title)
print (soup.title.text)

This is the output you will observe when you run this code −

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Selenium

It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool, but a suite of software. We have Python, Selenium bindings for Java, C#, Ruby, and JavaScript. Here, we’ll learn web scraping using Selenium and its Python bindings. You can learn more about Selenium and Java in the Selenium .NET link.

The Selenium Python bindings provide a convenient API for accessing Selenium WebDrivers such as Firefox, IE, Chrome, and Remote. Currently supported Python versions are 2.7, 3.5, and above.

Installing Selenium

Using the pip command, we can install urllib3 in our virtual environment or globally.

Prerequisites class=”language-python line-numbers”>pip install selenium

Since Selenium requires a driver to interface with the selected browser, we need to download it. The following table shows different browsers and their download links.

Browser https://sites.google.com/a/chromium.org/
Edge Browser https://developer.microsoft.com/
Firefox https://github.com/
Browser Browser https://webkit.org/

Example

This example shows using Selenium to perform a web crawler. It can also be used for testing, also known as Selenium testing.

After downloading the specific driver for the specified browser version, we need to program in Python.

First, we need to import webdriver from selenium as shown below.

from selenium import webdriver

Now, as requested, we provide the path to the web driver we downloaded.

path = r'C:UsersgauravDesktopChromedriver'
browser = webdriver.Chrome(executable_path = path)

Now, provide the URL we want to open in the web browser, which is now controlled by our Python script.

browser.get('<https://authoraditiagarwal.com/leadershipmanagement>')

We can also search for a specific element by providing an xpath provided by lxml.

browser.find_element_by_xpath('/html/body').click()

You can inspect the output of the browser controlled by the Python script.

Scrapy

Scrapy is a fast, open-source web crawling framework written in Python for scraping data from web pages using XPath-based selectors. Scrapy was first released on June 26, 2008, under a BSD license, and reached version 1.0 in June 2015. It provides all the tools you need to crawl, process, and structure data from websites.

Installing Scrapy

Using the pip command, we can install urllib3 in our virtual environment or globally.

pip install scrapy

Leave a Reply

Your email address will not be published. Required fields are marked *