Three approaches to scrape a web page
Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup
module, and finally with the powerful lxml
module.
Regular expressions
If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/3/howto/regex.html. Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python.
Note
Because each chapter might build or use parts of previous chapters, we recommend setting up your file structure similar to that in the book repository. All code can then be run from the code
directory in the repository so imports work properly. If you would like to set up a different structure, note that you will need to change all imports from other chapters (such as the from chp1.advanced_link_crawler...