How to parse websites and navigate the DOM using BeautifulSoup
When the browser displays a web page it builds a model of the content of the page in a representation known as the document object model (DOM). The DOM is a hierarchical representation of the page's entire content, as well as structural information, style information, scripts, and links to other content.
It is critical to understand this structure to be able to effectively scrape data from web pages. We will look at an example web page, its DOM, and examine how to navigate the DOM with Beautiful Soup.
Getting ready
We will use a small web site that is included in the www
folder of the sample code. To follow along, start a web server from within the www
folder. This can be done with Python 3 as follows:
www $ python3 -m http.server 8080 Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/) ...
The DOM of a web page can be examined in Chrome by right-clicking the page and selecting Inspect. This opens the Chrome Developer Tools...