Reading XML data
You may sometimes need to extract data from websites. Many providers also supply data in XML and JSON formats. In this recipe, we learn about reading XML data.
Getting ready
Make sure you have downloaded the files for this chapters and the filescd_catalog.xml
and WorldPopulation-wiki.htm
are in working directory of R. If the XML package is not already installed in your R environment, install the package now, as follows:
> install.packages("XML")
How to do it...
XML data can be read by following these steps:
- Load the library and initialize:
> library(XML)
> url <- "cd_catalog.xml"
- Parse the XML file and get the root node:
> xmldoc <- xmlParse(url) > rootNode <- xmlRoot(xmldoc) > rootNode[1]
- Extract the XML data:
> data <- xmlSApply(rootNode,function(x) xmlSApply(x, xmlValue))
- Convert the extracted data into a data frame:
> cd.catalog <- data.frame(t(data),row.names=NULL)
- Verify the results:
> cd.catalog[1:2,]
How it works...
The xmlParse
function returns an object of the XMLInternalDocument
class, which is a C-level internal data structure.
The xmlRoot()
function gets access to the root node and its elements. Let us check the first element of the root node:
> rootNode[1] $CD <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> attr(,"class") [1] "XMLInternalNodeList" "XMLNodeList"
To extract data from the root node, we use the xmlSApply()
function iteratively over all the children of the root node. The xmlSApply
function returns a matrix.
To convert the preceding matrix into a data frame, we transpose the matrix using the t()
function and then extract the first two rows from the cd.catalog
data frame:
> cd.catalog[1:2,] TITLE ARTIST COUNTRY COMPANY PRICE YEAR 1 Empire Burlesque Bob Dylan USA Columbia 10.90 1985 2 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988
There's more...
XML data can be deeply nested and hence can become complex to extract. Knowledge of XPath is helpful to access specific XML tags. R provides several functions, such as xpathSApply
and getNodeSet
, to locate specific elements.
Extracting HTML table data from a web page
Though it is possible to treat HTML data as a specialized form of XML, R provides specific functions to extract data from HTML tables, as follows:
> url <- "WorldPopulation-wiki.htm"
> tables <- readHTMLTable(url)
> world.pop <- tables[[6]]
The readHTMLTable()
function parses the web page and returns a list
of all the tables that are found on the page. For tables that have an id
attribute, the function uses the id
attribute as the name of that list element.
We are interested in extracting the "10 most populous countries", which is the fifth table, so we use tables[[6]]
.
Extracting a single HTML table from a web page
A single table can be extracted using the following command:
> table <- readHTMLTable(url,which=5)
Specify which
to get data from a specific table. R returns a data frame.