Caching responses
Scrapy comes with the ability to cache HTTP requests. This can greatly reduce crawling times if pages have already been visited. By enabling the cache, Scrapy will store every request and response.
How to do it
There is a working example in the 06/10_file_cache.py
script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED
to True
and HTTPCACHE_DIR
to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:
if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'CRITICAL', 'CLOSESPIDER_PAGECOUNT': 50, 'HTTPCACHE_ENABLED': True, 'HTTPCACHE_DIR': "." }) process.crawl(Spider) process.start()
We ask Scrapy to cache using files and to create a sub-directory in the current folder. We also instruct it to...