Scrapy crawler framework practice: using CSS selectors
Scrapy Crawler Framework Practice: Using CSS Selectors
What is the Scrapy Crawler Framework?
Scrapy is a powerful Python crawler framework that helps developers quickly and efficiently write crawler programs and provides many convenient tools and features. The Scrapy framework includes multiple modules, including the Engine, Scheduler, Downloader, Spider, and Pipeline. In Scrapy’s architecture, the Spider is the core component, responsible for parsing web pages and extracting the required information.
Introduction to CSS Selectors
In the Scrapy framework, we typically use CSS selectors to locate and extract data from web pages. CSS selectors are a syntax for selecting HTML elements by tag name, class name, ID, attribute, and other methods. In Scrapy, we can use CSS selectors to extract data from web pages and store it in a database or file.
Extracting Data Using CSS Selectors
Let’s say we want to extract news headlines and links from the example.com website. We can use the Scrapy framework and CSS selectors to achieve this. First, we need to create a spider class named NewsSpider
and define the website we want to crawl and the extraction rules.
import scrapy
class NewsSpider(scrapy.Spider):
name = 'news'
start_urls = ['http://example.com']
def parse(self, response):
news_titles = response.css('h2.news-title::text').extract()
news_links = response.css('a.news-link::attr(href)').extract()
for title, link in zip(news_titles, news_links):
yield {
'title': title,
'link': link
}
In the above code, we first define a spider class named NewsSpider
and then specify the website address http://example.com
to crawl. Next, the CSS selector response.css
is used in the parse
method to extract the news title and link, and the extracted data is returned as a dictionary. Finally, use the yield
keyword to output the data.
Running the Crawler
To run our crawler, execute the following command in the command line:
scrapy runspider NewsSpider.py -o news.csv
The above command will execute our crawler NewsSpider.py
and save the extracted data to the news.csv
file.
Real-World Example: Extracting Douban Movie Ranking Data
Now let’s look at a real-world example. We’ll use the Scrapy framework and CSS selectors to extract movie titles and ratings from the Douban Movie Rankings.
First, create a crawler class named DoubanMovieSpider
and define the website we want to crawl and the extraction rules.
import scrapy
class DoubanMovieSpider(scrapy.Spider):
name = 'douban_movie'
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
movies = response.css('div.item')
for movie in movies:
title = movie.css('span.title::text').extract_first()
rating = movie.css('span.rating_num::text').extract_first()
yield {
'title': title,
'rating': rating
}
In the above code, we define a spider class named DoubanMovieSpider
and specify the Douban movie ranking URL https://movie.douban.com/top250
to crawl. The parse
method uses CSS selectors to extract the movie title and rating, then returns the data as a dictionary.
To run our crawler, we also need to execute the following command in the command line:
scrapy runspider DoubanMovieSpider.py -o douban_movies.csv
The above command will execute our crawler program, DoubanMovieSpider.py
, and save the extracted data to the douban_movies.csv
file.
Summary
Through this article, we’ve learned about the Scrapy crawler framework and the basic usage of CSS selectors. CSS selectors are a powerful tool for locating elements and play a crucial role in crawlers. By effectively using CSS selectors, we can quickly and accurately extract information from web pages and implement the functionality of our crawler.