Adding cache support to the link crawler
To support caching, the download
function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download
function into a class so parameters can be set in the constructor and reused numerous times. Here is the updated implementation to support this:
from chp1.throttle import Throttle from random import choice import requests class Downloader: def __init__(self, delay=5, user_agent='wswp', proxies=None, cache={}): self.throttle = Throttle(delay) self.user_agent = user_agent self.proxies = proxies self.num_retries = None # we will set this per request self.cache = cache def __call__(self, url, num_retries...