Parsing HTML tables
After downloading the HTML pages from the server, we have to extract the required data from them. There are many modules in Python to help with this. Here we can make use of the Python package BeautifulSoup
.
Getting ready
As usual, make sure that you install all the required packages. For this script, we require BeautifulSoup
and pandas
. You can install them with pip
:
pip install bs4 pip install pandas
pandas
is an open source data analysis library in Python.
How to do it...
We can parse HTML tables from the downloaded pages as following:
- As usual, we have to import the required modules for the script. Here, we import
BeautifulSoup
for parsing HTML andpandas
for handling the data that is parsed. Also, we have to import theurllib
module for getting the web page from the server:
import urllib2
import pandas as pd
from bs4 import BeautifulSoup
- Now we can get the HTML page from the server; for this, we can use the
urllib
module:
url = "https://www.w3schools.com/html/html_tables...