Scrapy crawler framework practice: using CSS selectors

Scrapy Crawler Framework Practice: Using CSS Selectors

Scrapy Crawler Framework Practice: Using CSS SelectorsWhat is the Scrapy Crawler Framework?

Scrapy is a powerful Python crawler framework that helps developers quickly and efficiently write crawler programs and provides many convenient tools and features. The Scrapy framework includes multiple modules, including the Engine, Scheduler, Downloader, Spider, and Pipeline. In Scrapy’s architecture, the Spider is the core component, responsible for parsing web pages and extracting the required information.

Introduction to CSS Selectors

In the Scrapy framework, we typically use CSS selectors to locate and extract data from web pages. CSS selectors are a syntax for selecting HTML elements by tag name, class name, ID, attribute, and other methods. In Scrapy, we can use CSS selectors to extract data from web pages and store it in a database or file.


Extracting Data Using CSS Selectors

Let’s say we want to extract news headlines and links from the example.com website. We can use the Scrapy framework and CSS selectors to achieve this. First, we need to create a spider class named NewsSpider and define the website we want to crawl and the extraction rules.

import scrapy 

class NewsSpider(scrapy.Spider): 
name = 'news' 
start_urls = ['http://example.com'] 

def parse(self, response): 
news_titles = response.css('h2.news-title::text').extract() 
news_links = response.css('a.news-link::attr(href)').extract()

for title, link in zip(news_titles, news_links):
yield {
'title': title,
'link': link
}

In the above code, we first define a spider class named NewsSpider and then specify the website address http://example.com to crawl. Next, the CSS selector response.css is used in the parse method to extract the news title and link, and the extracted data is returned as a dictionary. Finally, use the yield keyword to output the data.

Running the Crawler

To run our crawler, execute the following command in the command line:

scrapy runspider NewsSpider.py -o news.csv 

The above command will execute our crawler NewsSpider.py and save the extracted data to the news.csv file.

Real-World Example: Extracting Douban Movie Ranking Data

Now let’s look at a real-world example. We’ll use the Scrapy framework and CSS selectors to extract movie titles and ratings from the Douban Movie Rankings.

First, create a crawler class named DoubanMovieSpider and define the website we want to crawl and the extraction rules.

import scrapy 

class DoubanMovieSpider(scrapy.Spider): 
name = 'douban_movie' 
start_urls = ['https://movie.douban.com/top250']

def parse(self, response):

movies = response.css('div.item')

for movie in movies:

title = movie.css('span.title::text').extract_first()

rating = movie.css('span.rating_num::text').extract_first()

yield {
'title': title,
'rating': rating
}

In the above code, we define a spider class named DoubanMovieSpider and specify the Douban movie ranking URL https://movie.douban.com/top250 to crawl. The parse method uses CSS selectors to extract the movie title and rating, then returns the data as a dictionary.

To run our crawler, we also need to execute the following command in the command line:

scrapy runspider DoubanMovieSpider.py -o douban_movies.csv

The above command will execute our crawler program, DoubanMovieSpider.py, and save the extracted data to the douban_movies.csv file.

Summary

Through this article, we’ve learned about the Scrapy crawler framework and the basic usage of CSS selectors. CSS selectors are a powerful tool for locating elements and play a crucial role in crawlers. By effectively using CSS selectors, we can quickly and accurately extract information from web pages and implement the functionality of our crawler.

Leave a Reply

Your email address will not be published. Required fields are marked *