Retrying failed page downloads
Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes:
[500, 502, 503, 504, 408]
The process can be further configured using the following parameters:
RETRY_ENABLED
(True/False - default is True)RETRY_TIMES
(# of times to retry on any errors - default is 2)RETRY_HTTP_CODES
(a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408])
How to do it
The 06/01_scrapy_retry.py
script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy:
process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start()
How it works
Scrapy will pick up the configuration for retries as specified...