Python Web Scraping Data Crawling
Data Scraping with Python Web Scraping
Analyzing a web page means understanding its structure. Now, the question is, why is it important for web scraping? In this chapter, let’s take a closer look.
Webpage Analysis
Webpage analysis is important because without analysis, we won’t know what form the data will take after scraping (structured or unstructured). We can analyze a web page by:
View Page Source
This is a way to understand the structure of a web page by inspecting its source code. To do this, right-click on the web page and select the View Page Source option. This will then retrieve the data we’re interested in from the page in the form of HTML. However, the focus will be on whitespace and formatting, which is difficult for us to understand.
Inspecting the Source of a Web Page by Clicking the Inspect Element Option
This is another way to analyze a web page. However, it addresses formatting and whitespace issues within a webpage’s source code. You can access this by right-clicking and selecting the Inspect or Inspect Element options from the menu. This provides information about a specific area or element of the webpage.
Different Methods for Scraping Data from Webpages
The following methods are primarily used for scraping data from webpages, namely:
Regular Expressions
Regular expressions are highly specialized programming languages built into Python. They can be used with Python’s re module. They are also known as REs, regexes, or regex patterns. With regular expressions, we can specify rules for the possible sets of strings we want to match against the data.
If you want to learn more about regular expressions, visit https://www.tutorialspoint.com/automata_theory/regular_expressions.htm. If you want to learn more about the re module or regular expressions in Python, you can click on the link https://www.tutorialspoint.com/python/python_reg_expressions.htm.
Example
In the following example, we will scrape data about India from http://example.webscraping.com after matching the content of <td>
with the help of regular expressions.
import re
import urllib.request
response =
urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)
Output
The corresponding output will be as follows
[
'<img src="/places/static/images/flags/in.png" />',
'3,287,590 square kilometers',
'1,173,108,018',
'IN',
'India',
'New Delhi',
'AS',
'.in',
'INR',
'Rupee',
'91',
'######',
'^(d{6})$',
'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
'<div>
CN
NP
MM
BT
PK
BD </div>'
]
Observe that in the output above, you can see detailed information about the country of India through the use of regular expressions.
BeautifulSoup
Suppose we want to collect all the hyperlinks from a webpage. We can use a parser called BeautifulSoup, which can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Simply put, BeautifulSoup is a Python library for scraping data from HTML and XML files. It can be used with requests because it requires an input (a document or URL) to create a soup object, as it cannot fetch a webpage on its own. You can use the following Python script to collect the title and hyperlinks of a webpage.
Installing Beautiful Soup
Using the pip command, we can install beautifulsoup in our virtual environment or globally.
(base) D:ProgramData>pip install bs4
Collecting bs4
Downloading
https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89
a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in d:programdatalibsitepackages
(from bs4) (4.6.0)
Building wheels for collected packages: bs4
Running setup.py bdist_wheel for bs4 ... done
Stored in Directory:
C:UsersgauravAppDataLocalpipCachewheelsa0b0b24f80b9456b87abedbc0bf2d
52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Example
Note that in this example, we are expanding on the previous example using the request Python module. We are using r.text to create a soup object, which will be used to retrieve details such as the webpage title.
First, we need to import the necessary Python modules:
import requests
from bs4 import BeautifulSoup
In the following line of code, we use requests to make a GET HTTP request to the URL: https://authoraditiagarwal.com/.
r = requests.get('<https://authoraditiagarwal.com/>')
Now we need to create a Soup object, as shown below:
soup = BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)
Output
The corresponding output will be as shown below.
<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal
Lxml
Another web scraping tool we’ll discuss is Installing lxml
Using pip, we can install lxml in our virtual environment or globally.
(base) D:ProgramData>pip install lxml
Collecting lxml
Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl
(3.
6MB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.5
<h3>Example: Scraping Data Using lxml and requests</h3>
<p>In the following example, we use lxml and requests to scrape a specific element from the <strong>authoraditiagarwal.com</strong> webpage.</p>
<p>First, we need to import requests and html from the lxml library as shown below.</p>
<pre><code class="language-python line-numbers">import requests
from lxml import html
Now, we need to provide the URL of the webpage to be scraped.
url = ['https://authoraditiagarwal.com/leadershipmanagement/']('https://authoraditiagarwal.com/leadershipmanagement/)
Now, we need to provide the (XPath) path to the specific element on the webpage. −
path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())
Output
The corresponding output will look like this
The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.