Using identifiable user agents
What happens if you violate the terms of service and get flagged by the website owner? How can you help the site owners in contacting you, so that they can nicely ask you to back off to what they consider a reasonable level of scraping?
What you can do to facilitate this is add info about yourself in the User-Agent header of the requests. We have seen an example of this in robots.txt
files, such as from amazon.com. In their robots.txt
is an explicit statement of a user agent for Google: GoogleBot.
During scraping, you can embed your own information within the User-Agent header of the HTTP requests. To be polite, you can enter something such as 'MyCompany-MyCrawler ([email protected])'. The remote server, if tagging you in violation, will definitely be capturing this information, and if provided like this, it gives them a convenient means of contacting your instead of just shutting you down.
How to do it
Setting the user agent differs depending upon what tools...