Web spiders with Scrapy
Web spidering starts with a URL or a list of URLs to visit, and when the spider gets a new page, it analyzes the page to identify all the hyperlinks, adding these links to the list of URLs to be crawled. This action continues recursively for as long as new data is found.
A web spider can find new URLs and index them for crawling or download useful data from them. In the following recipe, we will use Scrapy to create a web spider.
Getting ready
We can start by installing Scrapy. It can be installed from Python's pip
command:
pip install scrapy
Make sure that you have the required permission for installing Scrapy. If any errors occur with the permission, use the sudo
command.
How to do it...
Let's create a simple spider with the Scrapy:
- For creating a new spider project, open the Terminal and go to the folder for our spider:
$ mkdir new-spider$ cd new-spider
- Then run the following command to create a new spider project with
scrapy
:
$ scrapy startproject books
This will create...