Extracting text data from an HTML page
You have seen an example of reading the HTML source code as a text vector in the Extracting unstructured text data from a plain web page recipe in this chapter. In this recipe, further processing is not straightforward because the output object contains plain text as well as HTML code tags. It is a time-consuming task to clean up the HTML tags from plain text.
In this recipe, you will read the same web page from the following link:
https://en.wikipedia.org/wiki/Programming_with_Big_Data_in_R
However, this time, you will use a different strategy so that you can play with HTML tags.
Getting ready
To implement this recipe, you need to use a customized library, particularly, the rvest
library. If this library has not been installed into your computer, then this is the time to install it with its necessary dependencies. Here is the code to install the rvest
library:
install.packages("rvest", dependencies = T)
Once the installation has been completed, you are...