Using Scrapy selectors
Scrapy is a Python web spider framework that is used to extract data from websites. It provides many powerful features for navigating entire websites, such as the ability to follow links. One feature it provides is the ability to find data within a document using the DOM, and using the now, quite familiar, XPath.
In this recipe we will load the list of current questions on StackOverflow, and then parse this using a scrapy selector. Using that selector, we will extract the text of each question.
Getting ready
The code for this recipe is in 02/05_scrapy_selectors.py
.
How to do it...
We start by importing Selector
from scrapy
, and also requests
so that we can retrieve the page:
In [1]: from scrapy.selector import Selector ...: import requests ...:
Next we load the page. For this example we are going to retrieve the most recent questions on StackOverflow and extract their titles. We can make this query with the the following:
In [2]: response = requests.get("http://stackoverflow...