Controlling the length of a crawl
The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT setting.
How to do it
We will be using the script in 06/07_limit_length.py. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:
if __name__ == "__main__":
process = CrawlerProcess({
'LOG_LEVEL': 'INFO',
'CLOSESPIDER_PAGECOUNT': 5
})
process.crawl(Spider)
process.start()When this is run, the following output will be generated (interspersed in the logging output):
<200 https://www.nasa.gov/exploration/systems/sls/multimedia/sls-hardware-being-moved-on-kamag-transporter.html> <200 https://www.nasa.gov/exploration/systems/sls/M17-057.html> <200 https://www.nasa.gov/press-release/nasa-awards-contract-for-center-protective-services-for-glenn-research-center/> <200 https://www.nasa.gov/centers...