Sequential crawler
We can now use AlexaCallback
with a slightly modified version of the link crawler we developed earlier to download the top 500 Alexa URLs sequentially. To update the link crawler, it will now take either a start URL or a list of start URLs:
# In link_crawler function if isinstance(start_url, list): crawl_queue = start_url else: crawl_queue = [start_url]
We also need to update the way the robots.txt
is handled for each site. We use a simple dictionary to store the parsers per domain (see: https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py#L53-L72). We also need to handle the fact that not every URL we encounter will be relative, and some of them aren't even URLs we can visit, such as e-mail addresses with mailto:
or javascript:
event commands. Additionally, due to some sites not having the robots.txt
files and other poorly formed URLs, there are some additional error-handling sections added and a new no_robots
variable, which allows us to...