In this article by Eric Mayor, author of the book Learning Predictive Analytics with R, we will be discussing how to use PCA in R. Nowadays, the access to data is easier and cheaper than ever before. This leads to a proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not trivial, as the quantity often makes the analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (big data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). However, most organizations do not have such infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential of the information intact.

(For more resources related to this topic, see here.)

Another reason to reduce dimensionality is that, in some cases, there are more attributes than observations. If some scientists were to store the genome of all inhabitants of Europe and the United States, the number of cases (approximately 1 billion) would be much less than the 3 billion base pairs in the human DNA.

Most analyses do not work well or not at all when observations are less than attributes. Confronted with this problem, the data analysts might select groups of attributes, which go together (like height and weight, for instance) according to their domain knowledge, and reduce the dimensionality of the dataset.

The data is often structured across several relatively independent dimensions, with each dimension measured using several attributes. This is where principal component analysis (PCA) is an essential tool, as it permits each observation to receive a score on each of the dimensions (determined by PCA itself) while allowing to discard the attributes from which the dimensions are computed. What is meant here is that for each of the obtained dimensions, values (scores) will be produced that combines several attributes. These can be used for further analysis. In the next section, we will use PCA to combine several attributes in the questionnaire data. We will see that the participants' self-report on items such as lively, excited, enthusiastic (and many more) can be combined in a single dimension we call positive arousal. In this sense, PCA performs both dimensionality reduction (discard the attributes) and feature extraction (compute the dimensions). The features can then be used for classification and regression problems.

Another use of PCA is to check that the underlying structure of the data corresponds to a theoretical model. For instance, in a questionnaire, a group of questions might assess construct A, and another group construct B, and so on. PCA will find two different factors if indeed there is more similarity in the answers of participants within each group of questions compared to the overall questionnaire. Researchers in fields such as psychology use PCA mainly to test their theoretical model in this fashion.

This article is a shortened version of the chapter with the same name in my book Learning predictive analytics with R. In what follows, we will examine how to use PCA in R. We will notably discover how to interpret the results and perform diagnostics.

Learning PCA in R

In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The msq (for motivational state questionnaire) data set is composed of 92 attributes, of which 72 is the rating of adjectives by 3896 participants describing their mood. We will only use these 72 attributes for the current purpose, which is the exploration of the structure of the questionnaire. We will, therefore, start by installing and loading the package and the data, and assign the data we are interested in (the 75 attributes mentioned) to an object called motiv.

install.packages("psych")
library(psych)
data(msq)
motiv = msq[,1:72]

Dealing with missing values

Missing values are a common problem in real-life datasets like the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing values are encountered. Let's see how many missing values (we will call them NAs) there are for each attribute in this dataset:

apply(is.na(motiv),2,sum)

dimensionality-reduction-principal-component-analysis-img-0

View of missing data in the dataset

We can see that for many attributes this is unproblematic (there are only few missing values). However, in the case of several attributes, the number of NAs is quite high (anxious: 1849 NAs, cheerful: 1850, idle: 1848, inactive: 1846, tranquil: 1843). The most probable explanation is that these items have been dropped from some of the samples in which the data has been collected.

Removing all cases with NAs from the dataset would result in an empty dataset in this particular case. Try the following in your console to verify this claim:

na.omit(motiv)

For this reason, we will simply deal with the problem by removing these attributes from the analysis. Remember that there are other ways, such as data imputation, to solve this issue while keeping the attributes. In order to do so, we first need to know which column number corresponds to the attributes we want to suppress. An easy way to do this is simply by printing the names of each columns vertically (using cbind() for something it was not exactly made for). We will omit cases with missing values on other attributes from the analysis later.

head(cbind(names(motiv)),5)

Here, we only print the result for the first five columns of the data frame:

	[,1]
[1,]	active
[2,]	afraid
[3,]	alert
[4,]	angry
[5,]	anxious

We invite the reader to verify on his screen whether we suppress the correct columns from the analysis using the following vector to match the attributes:

ToSuppress = c(5, 15, 37, 38, 66)

Let's check whether it is correct:

names(motiv[ToSuppress])

The following is the output:

[1] "anxious" "cheerful" "idle"     "inactive" "tranquil"

Naming the components using the loadings

We will run the analysis using the principal() function from the psych package. This will allow examining the loadings in a sorted fashion and making them independent from the others using a rotation. We will extract five components. The psych package has already been loaded, as the data we are analyzing comes from a dataset, which is included in psych. Yet, we need to install another package, the GPArotation package, which is required for the analysis we will perform next:

install.packages("GPArotation")
library(GPArotation)

We can now run the analysis. This time we want to apply an orthogonal rotation varimax, in order to obtain independent factorial scores. Precisions are provided in the paper Principal component analysis by Abdi and Williams (2010). We also want that the analysis uses imputed data for missing values and estimates the scores for each observation on each retained principal component. We use the print.psych() function to print the sorted loadings, which will make the interpretation of the principal components easier:

Pca2 = principal(motiv[,-ToSuppress],nfactors = 5,
rotate = "varimax", missing = T, scores = T)
print.psych(Pca2, sort =T)

The annotated output displayed in the following screenshot has been slightly altered in order to allow it to fit on one page.

dimensionality-reduction-principal-component-analysis-img-1

Results of the PCA with the principal() function using varimax rotation

The name of the items are displayed in the first column of the first matrix of results. The column item displays their order. The component loadings for the five retained components are displayed next (RC1 to RC5). We will comment on h2 and u2 later. The loadings of the attributes on the loadings can be used to name the principal components. As can be seen in the preceding screenshot, the first component could be called Positive arousal, the second component could be called Negative arousal, the third could be called Serenity, the fourth Exhaustion, and the last component could be called Fear. It is worth noting that the MSQ has theoretically four components obtained from two dimensions: energy and tension. Therefore, the fifth component we found is not accounted for by the model.

The h2 value indicates the proportion of variance of the variable that is accounted for by the selected components. The u2 value indicates the part of variance not explained by the components. The sum of h2 and u2 is 1.

Below the first result matrix, a second matrix indicates the proportion of variance explained by the components on the whole dataset. For instance, the first component explains 21 percent of the variance in the dataset Proportion Var and 38 percent of the variance explained by the five components. Overall, the five components explain 56 percent of the variance in the dataset Cumulative Var. As a remainder, the purpose of PCA is to replace all the original attributes by the scores on these components in further analysis. This is what we will discuss next.

PCA scores

At this point, you might wonder where data reduction comes into play. Well, the computation of the scores for each factor allows reducing the number of attributes for a dataset. This has been done in the previous section.

Accessing the PCA scores

We will now examine these scores more thoroughly. Let's start by checking whether the scores are indeed uncorrelated:

round(cor(Pca2$scores),3)

This is the output:

	RC1	RC2	RC3	RC4	RC5
RC1	1.000	-0.001	0.000	0.000	0.001
RC2	-0.001	1.000	0.000	0.000	0.000
RC3	0.000	0.000	1.000	-0.001	0.001
RC4	0.000	0.000	-0.001	1.000	0.000
RC5	0.001	0.000	0.001	0.000	1.000

As can be seen in the previous output, the correlations between the values are equal to or lower than 0.001 for every pair of components. The components are basically uncorrelated, as we requested. Let's start by checking that it indeed contains the same number of rows:

nrow(Pca2$scores) == nrow(msq)

Here's the output:

[1] TRUE

As the principal() function didn't remove any cases but imputed the data instead, we can now append the factor scores to the original dataset (a copy of it):

bound = cbind(msq,Pca2$score)

PCA diagnostics

Here, we will briefly discuss two diagnostics that can be performed on the dataset that will be subjected to analysis. These diagnostics should be performed analyzing the data with PCA in order to ascertain that PCA is an optimal analysis for the considered data. The first diagnostic is Bartlett's test of sphericity. This test examines the relationship between the variables together, instead of 2 x 2 as in a correlation. To be more specific, it tests whether the correlation matrix is different from an identity matrix (which has ones on the diagonals and zeroes elsewhere). The null hypothesis is that the variables are independent (no underlying structure). The paper When is a correlation matrix appropriate for factor analysis? Some decision rules by Dzuiban and Shirkey (1974) provides more information on the topic.

This test can be performed by the cortest.normal() function in the psych package. In this case, the first argument is the correlation matrix of the data we want to subject to PCA, and the n1 argument is the number of cases. We will use the same data set as in the PCA (we first assign the name M to it):

M = na.omit(motiv[-ToSuppress])
cortest.normal(cor(M), n1 = nrow(M))

The output is provided as follows:

Tests of correlation matrices
Call:cortest.normal(R1 = cor(M), n1 = nrow(M))
Chi Square value 829268 with df = 2211   with probability < 0

The last line of the output shows that the data subjected to the analysis is clearly different from an identity matrix: The probability to obtain these results if it were an identity matrix is close to zero (do not pay attention to the > 0, it is simply extremely close to zero). You might be curious about what is the output of an identity matrix. It turns out we have something close to it: the correlations of the PCA scores with the varimax rotation that we examined before. Let's subject the scores to the analysis:

cortest.normal(cor(Pca2$scores), n1 = nrow(Pca2$scores))
Tests of correlation matrices
Call:cortest.normal(R1 = cor(Pca2$scores), n1 = nrow(Pca2$scores))
Chi Square value 0.01 with df = 10   with probability < 1

In this case, the results show that the correlation matrix is not significantly different from an identity matrix.

The other diagnostic is the Kaiser Meyer Olkin (KMO) index, which indicates the part of the data that can be explained by elements present in the dataset. The higher this score the more the proportion of the data is explainable, for instance, by PCA. The KMO (also called MSA for Measure of Sample Adequacy) ranges from 0 (nothing is explainable) to 1 (everything is explainable). It can be returned for each item separately, or for the overall dataset. We will examine this value. It is the first component of the object returned by the KMO()function from psych package. The second component is the list of the values for the individual items (not examined here). It simply takes a matrix, a data frame, or a correlation matrix as an argument. Let's run in on our data:

KMO(motiv)[1]

This returns a value of 0.9715381, meaning that most of our data is explainable by the analysis.

Summary

In this article, we have discussed how the PCA algorithm works, how to select the appropriate number of components and how to use PCA scores for further analysis.