Python Web Scraping Form-Based Websites

Python Web Scraping: Form-Based Websites

In the previous chapter, we looked at scraping dynamic websites. In this chapter, let’s learn about scraping websites that rely on user input, that is, form-based websites.

Introduction

Today, the WWW (World Wide Web) is evolving towards social media and user-generated content. Therefore, the question arises: how can we access this information beyond the login screen? To do this, we need to deal with forms and logins.

In previous chapters, we used the HTTP GET method to request information, but in this chapter, we will use the HTTP POST method to push information to a web server for storage and analysis.

Interacting with Login Forms

While working on the internet, you’ve undoubtedly interacted with login forms many times. They can be very simple, consisting of just a few HTML fields, a submit button, and an action page, or they can be complex, with additional fields like an email address, a message, and a CAPTCHA for security reasons.

In this section, we’ll handle a simple submission form with the help of the Python requests library.

First, we need to import the requests library as shown below.

import requests

Now, we need to provide information for the fields of the login form.

parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Email ID’,’Message’:’Type your message here’}

In the next line of code, we need to provide a URL where the form action will occur.

r = requests.post(“enter the URL”, data = parameters)
print(r.text)

After running this script, it will return the content of the page where the action occurred.

Suppose you want to submit any image using a form. This is very easy to do using requests.post(). You can understand this with the help of the following Python script.

import requests
file = {‘Uploadfile’: open(’C:Usresdesktop123.png’, ‘rb’)}
r = requests.post(“enter the URL”, files = file)
print(r.text)

Loading Cookies from a Web Server

A cookie, sometimes also called a web cookie or internet cookie, is a small piece of data sent by a website that our computer stores in a file located within our web browser.

In the context of login forms, there are two types of cookies. One, which we mentioned in the previous section, allows us to submit information to a website, and the second, which allows us to remain permanently “logged in” throughout our visit to a website. For the second form, websites use cookies to track who is logged in and who is not.

What Cookies Do

These days, most websites use cookies for tracking purposes. We can understand how cookies work with the help of the following steps.

Step 1 – First, the website will verify our login credentials and store them in a cookie on our browser. This cookie typically contains a server-generated token, timeout, and tracking information.

Step 2 – Next, the website will use the cookie as proof of authentication. This authentication is always displayed every time we visit the website.

Cookies are very problematic for web scrapers because if they don’t use tracking cookies, form submissions will be sent back to the next page, making it seem as if they never logged in. Tracking cookies is easy with the Python requests library, as shown below.

import requests
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Email ID’, ‘Message’:’Type your message here’}
r = requests.post(“enter the URL”, data = parameters)

In the above code, the URL will be the page that handles the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

After running the above script, we will retrieve the cookies from the results of the last request.

Another issue with cookies is that websites can modify cookies without warning. This can be handled using request.Session() as shown below.

import requests
session = requests.Session()
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Email ID’, ‘Message’:’Type your message here’}
r = session.post(“enter the URL”, data = parameters)

In the above code, the URL will be the page that handles the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

From this observation, you can easily understand the difference between scripts with and without sessions.

Form Automation with Python

In this section, we will discuss a Python module called Mechanize, which will reduce our workload and automate the process of filling out forms.

Mechanize Module

The Mechanize module provides a high-level interface for interacting with forms. Before we begin using it, we need to install it using the following command:

pip install mechanize

Note that this only works in Python 2.x.

Example

In this example, we’ll automatically fill out a login form with two fields: email and password.

import mechanize
brwsr = mechanize.Browser()
brwsr.open(Enter the login URL)
brwsr.select_form(nr = 0)
brwsr['email'] = 'Enter email'
brwsr['password'] = 'Enter password'
response = brwsr.submit()
brwsr.submit()

The code above is pretty self-explanatory. First, we import the mechanize module. Then, we create a Mechanize browser object. Then, we navigate to the login URL and select the form. Afterwards, the name and value are passed directly to the browser object.

Leave a Reply

Your email address will not be published. Required fields are marked *