Python web crawler processing images and videos

Python Web Scraping: Image and Video Processing

Web scraping typically involves downloading, storing, and processing web media content. In this chapter, let’s learn how to process content downloaded from the web.

Introduction

The web media content we obtain during scraping can be image, audio, and video files, in non-webpage formats, or as data files. However, can we trust the data we download, especially the extent of the data we download and store in our computer’s memory? This makes it crucial to understand the type of data we store locally.

Getting Media Content from Web Pages

In this section, we’ll learn how to download media content that correctly represents the media type, based on information from the web server. We can do this with the help of the Python requests module, just as we did in the previous chapter.

First, we need to import the necessary Python modules as shown below.

import requests

Now, provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

Use the following code to create an HTTP response object.

r = requests.get(url)

With the help of the following line of code, we can save the received content as a .png file.

with open("ThinkBig.png",'wb') as f:
f.write(r.content)

After running the above Python script, we will get a file named ThinkBig.png, which contains the downloaded image.

Extracting the Filename from a URL

After downloading content from a website, we also want to save it to a file using the filename found in the URL. But we can also check if the URL also contains a number of other fragments. To do this, we need to find the actual filename from the URL.

With the help of the following Python script, using urlparse, we can extract the filename from a URL.

import urllib3
import os
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path

You can observe the output, as shown below.

'/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'
os.path.basename(a.path)

You can observe the output, as shown below.

'MetaSlider_ThinkBig-1080x180.jpg'

Once you run the above script, we will get the file name from the URL.

Getting Content-Type Information from a URL

While scraping content from a web server using a GET request, we can also examine the information provided by the web server. With the help of the following Python script, we can determine what the web server means by content-type —

First, we need to import the necessary Python modules as shown below.

import requests

Now, we need to provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

The following line of code creates the HTTP response object.

r = requests.get(url, allow_redirects=True)

Now, we can get information about what types of content the web server can provide.

for headers in r.headers: print(headers)

You can observe the output as shown in the figure below.

Date
Server
Upgrade
Connection
Last-Modified
Accept-Ranges
Content-Length
Keep-Alive
Content-Type

With the help of the following line of code, we can get specific information about the content type, such as the content type —-.

print (r.headers.get('content-type'))

You can observe the output, as shown in the figure below.

image/jpeg

With the help of the following line of code, we can get specific information about the content type, such as the EType —

print (r.headers.get('ETag'))

You can observe the output, as shown in the figure below.

None

Observe the following command –

print (r.headers.get('content-length'))

You can observe the output, as shown in the following figure.

12636

With the help of the following line of code, we can get specific information about the content type, such as the server:

print (r.headers.get('Server'))

You can observe the output, as shown in the following figure.

Apache

Generating a Thumbnail of an Image

A thumbnail is a very small description or representation. Users may want to save only a thumbnail of a larger image, or both the image and the thumbnail. In this section, we’ll create a thumbnail for the image named ThinkBig.png that we downloaded in the previous section, “Getting Media from the Web.”

For this Python script, we’ll need to install the Python library called Pillow, a fork of the Python Imaging Library that has useful functions for working with images. It can be installed with the following command:

pip install pillow

The following Python script will create a thumbnail of an image and save it in the current directory. The thumbnail file will be prefixed with Th.

import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
img = Image.open(infile)
img.thumbnail((128, 128), Image.ANTIALIAS)
if infile[0:2] != "Th_":
img.save("Th_" + infile, "png")

The above code is pretty self-explanatory. You can find the thumbnail files in the current directory.

Taking Screenshots from a Website

Taking screenshots of websites is a very common task in web scraping. To achieve this, we’ll use Selenium and WebDriver. The following Python script will take screenshots from a website and save them to the current directory.

From selenium import webdriver
path = r'C:UsersgauravDesktopChromedriver'
browser = webdriver.Chrome(executable_path = path)
browser.get('https://tutorialspoint.com/')
screenshot = browser.save_screenshot('screenshot.png')
browser.quit

You can observe the output, as shown below.

DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0-
a571-892dc4c90eb7
<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver
(session="37e8e440e2f7807ef41ca7aa20ce7c97")>>

After running this script, you can check if you have a screenshot.png file in your current directory.

Processing Images and Videos

Video Thumbnail Generation

Suppose we have downloaded videos from a website and want to generate thumbnails for them so that we can click on a specific video based on the thumbnail. To generate thumbnails for our videos, we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. Once downloaded, we need to install it according to the specifications of our operating system.

The following Python script will generate a thumbnail of the video and save it to our local directory.

import subprocess
video_MP4_file = “C:Usersgauravdesktopsolar.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
vframes', '1', thumbnail_image_file, "-y"])

After running the above script, we will get a file named thumbnail_solar_video.jpg , and save it in our local directory.

Rip MP4 Videos to MP3

Suppose you’ve downloaded some video files from a website, but you only need the audio for your purpose. This can be accomplished in Python with the help of a Python library called moviepy, which can be installed with the help of the following command.

pip install moviepy

Now, after successfully installing moviepy, we can convert MP4 to MP3 with the help of the following script.

import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:UsersgauravDesktop1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3")

You can observe the output as shown below.

[MoviePy] Writing audio in movie_audio.mp3
100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00,
476.30it/s]
[MoviePy] Done.

The above script will save the audio MP3 file to a local directory.

Leave a Reply

Your email address will not be published. Required fields are marked *