Using an HTTP cache for development
The development of a web crawler is a process of exploration, and one that will iterate through various refinements to retrieve the requested information. During the development process, you will often be hitting remote servers, and the same URLs on those servers, over and over. This is not polite. Fortunately, scrapy also comes to the rescue by providing caching middleware that is specifically designed to help in this situation.
How to do it
Scrapy will cache requests using a middleware module named HttpCacheMiddleware. Enabling it is as simple as configuring the HTTPCACHE_ENABLED
setting to True
:
process = CrawlerProcess({ 'AUTOTHROTTLE_TARGET_CONCURRENCY': 3 }) process.crawl(Spider) process.start()
How it works
The implementation of HTTP caching is simple, yet complex at the same time. The HttpCacheMiddleware
provided by Scrapy has a plethora of configuration options based upon your needs. Ultimately, it comes down to storing each URL and its content...