Web Scraping with Python: Dynamic Websites

In this chapter, let’s learn how to web scrape dynamic websites and the detailed concepts involved.

Introduction

Web scraping is a complex task, and its complexity increases exponentially if the website is dynamic. According to the United Nations Global Web Accessibility Audit, over 70% of websites are dynamic, and their functionality relies on JavaScript.

Dynamic Website Example

Let’s look at an example of a dynamic website to understand why it’s difficult to scrape. Here, we will take a website called http://example.webscraping.com/places/default/search as an example to search. But how can we say that this website is dynamic? As can be seen from the output of the following Python script, which attempts to scrape data from the aforementioned webpage,

import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)

Output

[ ]

The output above shows that the example scraper failed to crawl the information because the div element we were trying to find was empty.

The div element was empty.

How to Scrape Data from Dynamic Websites

We’ve seen that scrapers can’t scrape information from dynamic websites because the data is loaded dynamically using JavaScript. In this case, we can use the following two techniques to scrape data from dynamic websites that rely on JavaScript:

Reverse Engineering JavaScript
Rendering JavaScript

Reverse Engineering JavaScript

A process known as reverse engineering can be useful in understanding how data is dynamically loaded by a web page.

To do this, we need to click the Inspect Element tag for a specific URL. Next, we’ll click on the NETWORK tab and find all requests to the page, including search.json, with the path /ajax . Instead of accessing AJAX data from a browser or through the NETWORK tag, we can use the following Python script:

import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()

Example

The above script allows us to access the JSON response using Python’s json method. Similarly, we can download the raw string response and load it using Python’s json.load method. We did this with the help of the following Python script. It basically scrapes all countries by searching for the letter “a” and then iterates over the resulting pages of JSON responses.

import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' + 'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
   print('Searching with %s' % letter)
   page = 0
   while True:
   response = requests.get(url.format(page, PAGE_SIZE, letter))
   data = response.json()
   print('adding %d records from the page %d' %(len(data.get('records')),page))
   for record in data.get('records'):countries.add(record['country'])
   page += 1
   if page >= data['num_pages']:
break
with open('countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))

After running the above script, we will get the following output. The records will be saved in a file called countries.txt.

Output

Searching with a
adding 15 records from the page 0
adding 15 records from the page 1
...

Rendering JavaScript

In the previous section, we reverse engineered a webpage and learned how the API works and how we can use it to retrieve the results of a single request. However, when reverse engineering, we may face the following difficulties.

Sometimes websites can be very difficult to understand. For example, if a website is built with advanced browser tools like Google Web Toolkit (GWT), the resulting JavaScript code will be machine-generated and difficult to understand and reverse-engineer.
Some more advanced frameworks, such as React.js, make reverse engineering difficult by abstracting already complex JavaScript logic.

A solution to this difficulty is to use the browser rendering engine to parse HTML, apply CSS formatting, and execute JavaScript to display the web page.

Example

In this example, to render Java Script, we’ll use a familiar Python module, Selenium. The following Python code will render a webpage with the help of Selenium –

First, we need to import webdriver from Selenium as shown below

from selenium import webdriver

Now, as per our requirement, provide the path to the web driver we have downloaded

path = r'C:UsersgauravDesktopChromedriver'
driver = webdriver.Chrome(executable_path = path)

Now, provide the URL we want to open in the web browser, which is now controlled by our Python script.

driver.get(['http://example.webscraping.com/search'](http://example.webscraping.com/search))

Now, we can use the search toolbox’s ID to set the element to select.

driver.find_element_by_id('search_term').send_keys('.')

Next, we can use javascript to set the contents of the select box as shown below

js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)

The following line of code shows that the search is ready to be clicked on the web page −

driver.find_element_by_id('search').click()

The next line of code shows that it will wait 45 seconds for the AJAX request to complete.

driver.implicitly_wait(45)

Now, to select the country links, we can use CSS selectors as shown below

links = driver.find_elements_by_css_selector('#results a')

Now, the text of each link can be extracted to create a list of countries −

countries = [link.text for link in links]
print(countries)
driver.close()

Python web crawling dynamic website

Web Scraping with Python: Dynamic Websites

Introduction

Dynamic Website Example

Output

How to Scrape Data from Dynamic Websites

Reverse Engineering JavaScript

Example

Output

Rendering JavaScript

Example

Leave a ReplyCancel Reply

Web Scraping with Python: Dynamic Websites

Introduction

Dynamic Website Example

Output

How to Scrape Data from Dynamic Websites

Reverse Engineering JavaScript

Example

Output

Rendering JavaScript

Example

Related Posts

Python 3 – String isalnum() Method

Pytest test execution results in XML format

Comprehensive analysis of PLT image storage in Python

Leave a ReplyCancel Reply