Retrying failed page downloads
Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes:
[500, 502, 503, 504, 408]
The process can be further configured using the following parameters:
RETRY_ENABLED(True/False - default is True)RETRY_TIMES(# of times to retry on any errors - default is 2)RETRY_HTTP_CODES(a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408])
How to do it
The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy:
process = CrawlerProcess({
'LOG_LEVEL': 'DEBUG',
'DOWNLOADER_MIDDLEWARES':
{
"scrapy.downloadermiddlewares.retry.RetryMiddleware": 500
},
'RETRY_ENABLED': True,
'RETRY_TIMES': 3
})
process.crawl(Spider)
process.start()How it works
Scrapy will pick up the configuration for retries as specified...