Extracting data from HTML documents
We can extract the parsed data to .csv or Excel format with the help of the pandas
library.
Getting ready
To use the functions in the pandas
module that export the parsed data to Excel, we require another dependent module openpyxl
, so please make sure you install the openpyxl
with pip
:
pip install openpyxl
How to do it...
We can extract the data from HTML to .csv or Excel documents as following:
- To create a .csv file, we can use the
to_csv()
method inpandas
. We can rewrite the previous recipe as follows:
import urllib.request import pandas as pd from bs4 import BeautifulSoup url = "https://www.w3schools.com/html/html_tables.asp" try: page = urllib.request.urlopen(url) except Exception as e: print(e) pass soup = BeautifulSoup(page, "html.parser") table = soup.find_all('table')[0] new_table = pd.DataFrame(columns=['Company', 'Contact', 'Country'], index=range(0, 7)) row_number = 0 for row in table.find_all('tr'): column_number...