Handling paginated websites
Pagination breaks large sets of content into a number of pages. Normally, these pages have a previous/next page link for the user to click. These links can generally be found with XPath or other means and then followed to get to the next page (or previous). Let's examine how to traverse across pages with Scrapy. We'll look at a hypothetical example of crawling the results of an automated internet search. The techniques directly apply to many commercial sites with search capabilities, and are easily modified for those situations.
Getting ready
We will demonstrate handling pagination with an example that crawls a set of pages from the website in the provided container. This website models five pages with previous and next links on each page, along with some embedded data within each page that we will extract.
The first page of the set can be seen at http://localhost:5001/pagination/page1.html
. The following image shows this page open, and we are inspecting the Next...