Using auto throttling
Fairly closely tied to controlling the maximum level of concurrency is the concept of throttling. Websites vary in their ability to handle requests, both across multiple websites and on a single website at different times. During periods of slower response times, it makes sense to lighten up of the number of requests during that time. This can be a tedious process to monitor and adjust by hand.
Fortunately for us, scrapy also provides an ability to do this via an extension named AutoThrottle
.
How to do it
AutoThrottle can easily be configured using the AUTOTHROTTLE_TARGET_CONCURRENCY
setting:
process = CrawlerProcess({ 'AUTOTHROTTLE_TARGET_CONCURRENCY': 3 }) process.crawl(Spider) process.start()
How it works
scrapy tracks the latency on each request. Using that information, it can adjust the delay between requests to a specific domain so that there are no more than AUTOTHROTTLE_TARGET_CONCURRENCY
requests simultaneously active for that domain, and that the requests...