Preventing bans by scraping via proxies
Sometimes you may get blocked by a site that your are scraping because you are identified as a scraper, and sometimes this happens because the webmaster sees the scrape requests coming from a uniform IP, at which point they simply block access to that IP.
To help prevent this problem, it is possible to use proxy randomization middleware within Scrapy. There exists a library, scrapy-proxies
, which implements a proxy randomization feature.
Getting ready
You can get scrapy-proxies
from GitHub at https://github.com/aivarsk/scrapy-proxies or by installing it using pip install scrapy_proxies
.
How to do it
Use of scrapy-proxies
is done by configuration. It starts by configuring DOWNLOADER_MIDDLEWARES
, and making sure they have RetryMiddleware
, RandomProxy
, and HttpProxyMiddleware
installed. The following would be a typical configuration:
# Retry many times since proxies often fail RETRY_TIMES = 10 # Retry on most error codes since proxies fail for different reasons...