Crawling links on Wikipedia
In this recipe we will write a small program to utilize the crawl the links on a Wikipedia page through several levels of depth. During this crawl we will gather the relationships between the pages and those referenced from each page. During this we will build a relationship amongst these pages the we will ultimately visualize in the next recipe.
Getting ready
The code for this example is in the 08/05_wikipedia_scrapy.py
. It references code in a module in the modules
/wikipedia
folder of the code samples, so make sure that is in your Python path.
How to do it
You can the sample Python script. It will crawl a single Wikipedia page using Scrapy. The page it will crawl is the Python page at https://en.wikipedia.org/wiki/Python_(programming_language), and collect relevant links on that page.
When run you will see the similar output to the following:
/Users/michaelheydt/anaconda/bin/python3.6 /Users/michaelheydt/Dropbox/Packt/Books/PyWebScrCookbook/code/py/08/05_wikipedia_scrapy...