Python web crawling and processing verification codes
Web Scraping with Python and Handling CAPTCHAs
In this chapter, let’s learn how to scrape the web and handle CAPTCHAs, which are used to test whether a user is human or a robot.
What is a CAPTCHA?
CAPTCHA stands for Completely Automated Public Turing Test to tell Computers and Humans Apart, which clearly indicates that it is a test to determine whether a user is human.
A CAPTCHA is a distorted image that is not typically easily detected by computer programs but can be interpreted by humans in some way. Most websites use CAPTCHAs to prevent robots from interacting.
Loading CAPTCHAs with Python
Suppose we want to register on a website and there is a form with a CAPTCHA. Before loading the CAPTCHA image, we need to know the specific information the form requires. With the help of the next Python script, we can understand the requirements of the registration form on the website: http://example.webscrapping.com.
import lxml.html
import urllib.request as urllib2
import print
import http.cookiejar as cookielib
def form_parsing(html):
tree = lxml.html.fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data
REGISTER_URL = 'http://example.webscraping.com/user/register'
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
' http://example.webscraping.com/places/default/user/register?_next = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)
In the above Python script, we first define a function that parses the form using the lxml Python module and then prints the form requirements, as shown below.
{
'_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
'_formname': 'register',
'_next': '/places/default/index',
'email': '',
'first_name': '',
'last_name': '',
'password': '',
'password_two': '',
'recaptcha_response_field': None
}
As you can see from the output above, with the exception of recpatcha_response_field, all the information is understandable and straightforward. Now the question becomes, how do we process this complex information and download the verification code? This can be accomplished with the help of the Pillow Python library, as shown below.
Pillow Python Package
Pillow is a fork of the Python Imaging Library that has useful functions for manipulating images. It can be installed with the help of the following command —
pip install pillow
In the next example, we will use it to load the captcha:
from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
tree = lxml.html.fromstring(html)
img_data = tree.cssselect('div#recaptcha img')[0].get('src')
img_data = img_data.partition(',')[-1]
binary_img_data = img_data.decode('base64')
file_like = BytesIO(binary_img_data)
img = Image.open(file_like)
return img
The above Python script uses the pillow Python package and defines a function to load a CAPTCHA image. This must be used in conjunction with the form_parser() function defined in the previous script to retrieve the registration form information. This script will save the CAPTCHA image in a useful format that can be further scraped as a string.
OCR: Extracting Text from Images with Python
After loading the CAPTCHA in a useful format, we can extract it with the help of optical character recognition (OCR), which is the process of extracting text from images. For this purpose, we will use the open-source Tesseract OCR engine. It can be installed with the help of the following command:
pip install pytesseract
Example
Here we will expand on the above Python script by loading the captcha using the Pillow Python package, as shown below.
import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')
The above Python script will read the captcha in black and white mode, which will be clear and easy to pass to the Rubik’s Cube, as shown below.
pytesseract.image_to_string(bw)
After running the above script, we will get the captcha for the registration form as output.