In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book.

In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques.

(For more resources related to this topic, see here.)

Three approaches to scrape a web page

Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module.

Regular expressions

If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python.

To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows:

>>> import re 
>>> from advanced_link_crawler import download 
>>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' 
>>> html = download(url) 
>>> re.findall(r'(.*?)', html) 
['<'img src="/places/static/images/flags/gb.png" />', 
  '244,820 square kilometres', 
  '62,348,447', 
  'GB', 
  'United Kingdom', 
  'London', 
  'EU', 
  '.uk', 
  'GBP', 
  'Pound', 
  '44', 
  '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', 
  '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2}     [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2})       |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 
  'en-GB,cy-GB,gd', 
  'IE ']

This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows:

>>> re.findall('(.*?)', html)[1]
'244,820 square kilometres'

This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique:

>>> re.findall('
Area: (.*?)
', html)
['244,820 square kilometres']

This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the

tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities:

>>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?)
''', html) ['244,820 square kilometres']

This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs.

From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as.

Beautiful Soup

Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command:

pip install beautifulsoup4

The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags:

<ul class=country> 
             <li>Area 
             <li>Population 
         </ul>

If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this:

>>> from bs4 import BeautifulSoup 
 >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' 
 >>> # parse the HTML 
 >>> soup = BeautifulSoup(broken_html, 'html.parser') 
 >>> fixed_html = soup.prettify() 
 >>> print(fixed_html)
  
 <ul class="country">
  <li>
   Area
   <li>
    Population
   </li>
  </li>
 </ul>

We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip:

pip install html5lib

Now, we can repeat this code, changing only the parser like so:

>>> soup = BeautifulSoup(broken_html, 'html5lib') 
 >>> fixed_html = soup.prettify() 
 >>> print(fixed_html)
 <html>
    <head>
    </head>
    <body>
      <ul class="country">
        <li>
          Area
        </li>
        <li>
          Population
        </li>
      </ul>
    </body>
 </html>

Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods:

>>> ul = soup.find('ul', attrs={'class':'country'}) 
 >>> ul.find('li')  # returns just the first match 
 <li>Area</li> 
 >>> ul.find_all('li')  # returns all matches 
 [<li>Area</li>, <li>Population</li>]

For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Now, using these techniques, here is a full example to extract the country area from our example website:

>>> from bs4 import BeautifulSoup 
 >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' 
 >>> html = download(url) 
 >>> soup = BeautifulSoup(html)   
 >>> # locate the area row 
 >>> tr = soup.find(attrs={'id':'places_area__row'}) 
 >>> td = tr.find(attrs={'class':'w2p_fw'})  # locate the data element
 >>> area = td.text  # extract the text from the data element
 >>> print(area) 
 244,820 square kilometres

This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code.

Lxml

Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so: https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python.

As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML:

>>> from lxml.html import fromstring, tostring
 >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' 
 >>> tree = fromstring(broken_html)  # parse the HTML  
 >>> fixed_html = tostring(tree, pretty_print=True) 
 >>> print(fixed_html) 
 <ul class="country"> 
     <li>Area</li> 
     <li>Population</li> 
 </ul>

As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert.

After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so:

pip install cssselect

Now we can use the lxml CSS selectors to extract the area data from the example page:

>>> tree = fromstring(html) 
 >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] 
 >>> area = td.text_content() 
 >>> print(area) 
 244,820 square kilometres

By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples.

Summary

We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples.