Applying the Algorithm to Real Data
Let's use our Python implementation of the PageRank algorithm to some larger-scale data. We will use a dataset shared by J. Kleinberg at Cornell by crawling the web to find web pages containing the word California. It is a text file in the following form:
Type Source Destination n 0 http://www.berkeley.edu/ n 1 http://www.caltech.edu/ … n 9663 http://www.cs.ucl.ac.uk/external/P.Dourish/hotlist.html e 0 449 e 0 450 … e 9663 7907
The first part contains 9,663 web pages that have the word California, and the rest is an adjacency list for the graph representing the "internet" of these 9,663 web pages. For example, take the following line:
e 0 499
This means web page 0 has a link to web page 499. In order to implement PageRank on this dataset, we need to create an adjacency matrix.
Let's use some Python code to read this data file into a pandas DataFrame and display it:
# import the pandas library import...