Controlling the length of a crawl
The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT
setting.
How to do it
We will be using the script in 06/07_limit_length.py
. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:
if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'INFO', 'CLOSESPIDER_PAGECOUNT': 5 }) process.crawl(Spider) process.start()
When this is run, the following output will be generated (interspersed in the logging output):
<200 https://www.nasa.gov/exploration/systems/sls/multimedia/sls-hardware-being-moved-on-kamag-transporter.html> <200 https://www.nasa.gov/exploration/systems/sls/M17-057.html> <200 https://www.nasa.gov/press-release/nasa-awards-contract-for-center-protective-services-for-glenn-research-center/> <200 https://www.nasa.gov/centers...