Extracting data from HTML documents
We can extract the parsed data to .csv or Excel format with the help of the pandas library.
Getting ready
To use the functions in the pandas module that export the parsed data to Excel, we require another dependent module openpyxl, so please make sure you install the openpyxl with pip:
pip install openpyxlHow to do it...
We can extract the data from HTML to .csv or Excel documents as following:
- To create a .csv file, we can use the
to_csv()method inpandas. We can rewrite the previous recipe as follows:
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.w3schools.com/html/html_tables.asp"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(e)
pass
soup = BeautifulSoup(page, "html.parser")
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=['Company', 'Contact', 'Country'], index=range(0, 7))
row_number = 0
for row in table.find_all('tr'):
column_number...