Python web crawling and data processing

Python Web Scraping and Data Processing

In previous chapters, we learned about scraping data from the web, or web scraping, using various Python modules. In this chapter, let’s look at various techniques for processing scraped data.

Introduction

To process scraped data, we must store it locally in a specific format, such as a spreadsheet (CSV), JSON, or sometimes in a database, such as MySQL.

CSV and JSON Data Processing

First, we need to write the information we scrape from the web into a CSV file or spreadsheet. Let’s start with a simple example. In this example, we’ll first use the BeautifulSoup module to scrape the information, as we did previously. Then, using the Python CSV module, we’ll write this text information into a CSV file.

First, we need to import the necessary Python libraries as shown below.

import requests
from bs4 import BeautifulSoup
import csv

In the following line of code, we use requests to make a GET HTTP request to the URL https://authoraditiagarwal.com/.

r = requests.get('<https://authoraditiagarwal.com/>')

Now, we need to create a Soup object as shown below:

soup = BeautifulSoup(r.text, 'lxml')

Now, with the help of the next few lines of code, we will write the scraped data to a CSV file called dataprocessing.csv.

f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])

After running this script, the text information or webpage title will be saved in the above CSV file on your local machine.

Similarly, we can also save the collected information in a JSON file. Below is a simple Python script that scrapes the same information as the previous Python script, but this time, the scraped information is saved in JSONfile.txt using the JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
json.dump(y, outfile)

After running this script, the crawled information, namely the webpage titles, will be saved in the above text file on your local machine.

Data Processing with AWS S3

Sometimes, we might want to store the crawled data locally for archiving. But what if we need to store and analyze this data at scale? The answer is a cloud storage service called Amazon S3, or AWS S3 (Simple Storage Service). Basically, AWS S3 is an object storage service that’s used to store and retrieve data from anywhere.

We can store data in AWS S3 by following these steps –

Step 1 – First, we’ll need an AWS account, which will provide us with a secret key to use when storing data in our Python script. This will create an S3 bucket where we can store our data.

Step 2 – Next, we’ll need to install the boto3 Python library to access the S3 bucket. This can be installed with the help of the following command:

pip install boto3

Step 3 – Next, we can use the following Python script to scrape data from a webpage and save it to an AWS S3 bucket.

First, we need to import the Python libraries for scraping. Here, we’ll be working with requests and boto3 for saving the data to an S3 bucket.

import requests
import boto3

Now we can scrape the data from our URL.

data = requests.get("Enter the URL").text

Now, to store the data in the S3 bucket, we need to create the S3 client as shown below.

s3 = boto3.client('s3')
bucket_name = "our-content"

The next line of code creates the S3 bucket, as shown below.

s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')
s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")

Now you can check the bucket named “our-content” in your AWS account.

Processing Data with MySQL

Let’s learn how to use MySQL to process data. If you want to learn about MySQL, you can follow the link: MySQL.

With the help of the following steps, we can scrape and process the data into a MySQL table.

Step 1 – First, using MySQL, we need to create a database and table in which we will store our scraped data. For example, let’s create a table using the following query:

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

Step 2 – Next, we need to handle Unicode. Note that MySQL does not handle Unicode by default. We need to enable this feature with the help of the following command, which will change the default character set for the database, table, and two columns.

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 – Now, let’s integrate MySQL with Python. To do this, we’ll need PyMySQL, which can be installed with the help of the following command:

pip install PyMySQL

Step 4 – Now, the database we created earlier, named Scrap, is ready to store scraped data from the web in a table called Scrap_pages. In our example, we’ll scrape data from Wikipedia and store it in our database.

First, we need to import the required Python modules.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

Now, let’s connect it to Python.

conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
cur.connection.commit()

Now, connect to Wikipedia and retrieve data from it.

def getLinks(articleUrl):
   html = urlopen('http://en.wikipedia.org'+articleUrl)
   bs = BeautifulSoup(html, 'html.parser')
   title = bs.find('h1').get_text()
content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
store(title, content)
return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)

Finally, we need to close both the cursor and the connection.

finally:
cur.close()
conn.close()

This will save the data scraped from Wikipedia to a table called scrap_pages. If you’re familiar with MySQL and web scraping, the above code should be easy to understand.

Data Processing with PostgreSQL

PostgreSQL is an open-source relational database management system (RDMS) developed by a global team of volunteers. The process of processing scraped data with PostgreSQL is similar to that with MySQL, with two differences. First, the commands are different from MySQL, and second, we’ll use the psycopg2 Python library for Python integration. If you are not familiar with PostgreSQL, you can find it at https://www.tutorialspoint.com/postgresql/postgresql-top-tutorials/1000100_postgresql_index.html/. We can install the psycopg2 Python library with the help of the following command:

pip install psycopg2

Leave a Reply

Your email address will not be published. Required fields are marked *