Python web crawling dynamic website
Web Scraping with Python: Dynamic Websites
In this chapter, let’s learn how to web scrape dynamic websites and the detailed concepts involved.
Introduction
Web scraping is a complex task, and its complexity increases exponentially if the website is dynamic. According to the United Nations Global Web Accessibility Audit, over 70% of websites are dynamic, and their functionality relies on JavaScript.
Dynamic Website Example
Let’s look at an example of a dynamic website to understand why it’s difficult to scrape. Here, we will take a website called http://example.webscraping.com/places/default/search as an example to search. But how can we say that this website is dynamic? As can be seen from the output of the following Python script, which attempts to scrape data from the aforementioned webpage,
import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)
Output
[ ]
The output above shows that the example scraper failed to crawl the information because the div element we were trying to find was empty.
<
The div element was empty.
How to Scrape Data from Dynamic Websites
We’ve seen that scrapers can’t scrape information from dynamic websites because the data is loaded dynamically using JavaScript. In this case, we can use the following two techniques to scrape data from dynamic websites that rely on JavaScript:
- Reverse Engineering JavaScript
- Rendering JavaScript
Reverse Engineering JavaScript
A process known as reverse engineering can be useful in understanding how data is dynamically loaded by a web page.
To do this, we need to click the Inspect Element tag for a specific URL. Next, we’ll click on the NETWORK tab and find all requests to the page, including search.json, with the path /ajax . Instead of accessing AJAX data from a browser or through the NETWORK tag, we can use the following Python script:
import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()
Example
The above script allows us to access the JSON response using Python’s json method. Similarly, we can download the raw string response and load it using Python’s json.load method. We did this with the help of the following Python script. It basically scrapes all countries by searching for the letter “a” and then iterates over the resulting pages of JSON responses.
import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' + 'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
print('Searching with %s' % letter)
page = 0
while True:
response = requests.get(url.format(page, PAGE_SIZE, letter))
data = response.json()
print('adding %d records from the page %d' %(len(data.get('records')),page))
for record in data.get('records'):countries.add(record['country'])
page += 1
if page >= data['num_pages']:
break
with open('countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))
After running the above script, we will get the following output. The records will be saved in a file called countries.txt.
Output
Searching with a
adding 15 records from the page 0
adding 15 records from the page 1
...
Rendering JavaScript
In the previous section, we reverse engineered a webpage and learned how the API works and how we can use it to retrieve the results of a single request. However, when reverse engineering, we may face the following difficulties.
- Sometimes websites can be very difficult to understand. For example, if a website is built with advanced browser tools like Google Web Toolkit (GWT), the resulting JavaScript code will be machine-generated and difficult to understand and reverse-engineer.
-
Some more advanced frameworks, such as React.js, make reverse engineering difficult by abstracting already complex JavaScript logic.
A solution to this difficulty is to use the browser rendering engine to parse HTML, apply CSS formatting, and execute JavaScript to display the web page.
Example
In this example, to render Java Script, we’ll use a familiar Python module, Selenium. The following Python code will render a webpage with the help of Selenium –
First, we need to import webdriver from Selenium as shown below
from selenium import webdriver
Now, as per our requirement, provide the path to the web driver we have downloaded
path = r'C:UsersgauravDesktopChromedriver'
driver = webdriver.Chrome(executable_path = path)
Now, provide the URL we want to open in the web browser, which is now controlled by our Python script.
driver.get(['http://example.webscraping.com/search'](http://example.webscraping.com/search))
Now, we can use the search toolbox’s ID to set the element to select.
driver.find_element_by_id('search_term').send_keys('.')
Next, we can use javascript to set the contents of the select box as shown below
js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)
The following line of code shows that the search is ready to be clicked on the web page −
driver.find_element_by_id('search').click()
The next line of code shows that it will wait 45 seconds for the AJAX request to complete.
driver.implicitly_wait(45)
Now, to select the country links, we can use CSS selectors as shown below
links = driver.find_elements_by_css_selector('#results a')
Now, the text of each link can be extracted to create a list of countries −
countries = [link.text for link in links]
print(countries)
driver.close()