Extracting text data from an HTML page using the XML library
The XML library is another R library for extracting text data from the HTML web page. In the previous recipes, you saw how to extract text from a web page using readLines()
and then using the rvest
library. In this recipe, you will go through the functions of the XML library to extract the same data.
Getting ready
Before using the XML library, you have to install it into your computer. To install the XML library, you can use the following code:
install.packages("XML", dependencies = T)
Once the installation is completed, you are ready to implement this recipe.
How to do it…
The HTML file is a tree-like structure. It represents the data using various internal nodes. Each node is represented by tag pair, such as <p>…</p>
. The steps are as follows:
- Create an R object containing the character string of the website address.
- Load the XML library into your R session.
- Parse the link into the
htmlTreeParse()
function, and make sure...