Link extractor with Scrapy
As their name indicates, link extractors are the objects that are used to extract links from the Scrapy response object. Scrapy has built-in link extractors, such as scrapy.linkextractors
.
How to do it...
Let's build a simple link extractor with Scrapy:
- As we did for the previous recipe, we have to create another spider for getting all the links.
In the new spider
file, import the required modules:
import scrapy from scrapy.linkextractor import LinkExtractor from scrapy.spiders import Rule, CrawlSpider
- Create a new
spider
class and initialize the variables:
class HomeSpider2(CrawlSpider): name = 'home2' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/']
- Now we have to initialize the rule for crawling the URL:
rules = [ Rule( LinkExtractor( canonicalize=True, unique=True ), follow=True, callback="parse_page" ) ]
This rule orders the extraction of...