In this article by Ashish Kumar and Avinash Paul the authors of the book Mastering Text Mining with R, we will Data volume and high dimensions pose an astounding challenge in text mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more sarduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small amount of information and still being able to reduce the high dimension space into a manageable proportion.

(For more resources related to this topic, see here.)

For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance.

The curse of dimensionality

Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called curse of dimensionality. Some literature also calls it concentration of measure:

Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other.
The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation.
The number of samples required per attribute increases exponentially with the increase in dimensions.
A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity.
Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates, and poor predictors.

Distance concentration and computational infeasibility

Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures.

Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting.

In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. For instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required.

Dimensionality reduction

Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data.

A textual collection is mostly represented in form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the.

Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM).

We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R.

Principal component analysis

Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs.

PCA comprises the following steps:

Compute the n-dimensional mean of the given dataset.
Compute the covariance matrix of the features.
Compute the eigenvectors and eigenvalues of the covariance matrix.
Rank/sort the eigenvectors by descending eigenvalue.
Choose x eigenvectors with the largest eigenvalues.

Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space.

PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis.

Using R for PCA

R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns.

prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function.

princomp() fails in situations if the number of variables is larger than the number of observations.

Each function returns a list whose class is prcomp() or princomp().

The information returned and terminology is summarized in the following table:

prcomp()	princomp()	Explanation
sdev	sdev	Standard deviation of each column
Rotations	Loading	Principle components
Center	Center	Subtracted value of each row or column to get the center data
Scale	Scale	Scale factors used
X	Score	The rotated data
	n.obs	Number of observations of each variable
	Call	The call to function that created the object

Here's a list of the functions available in different R packages for performing PCA:

PCA(): FactoMineR package
acp(): amap package
prcomp(): stats package
princomp(): stats package
dudi.pca(): ade4 package
pcaMethods: This package from Bioconductor has various convenient methods to compute PCA

Understanding the FactoMineR package

FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package:

library(FactoMineR)
data<-replicate(10,rnorm(1000))
result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T)
print(result.pca)

Results for the principal component analysis (PCA).

The analysis was performed on 1,000 individuals, described by nine variables.

The results are available in the following objects:

Name	Description
$eig	Eigenvalues
$var	Results for the variables
$var$coord	coord. for the variables
$var$cor	Correlations variables - dimensions
$var$cos2	cos2 for the variables
$var$contrib Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at €14.99/month. Cancel anytime	Contributions of the variables
$ind	Results for the individuals
$ind$coord	coord. for the individuals
$ind$cos2	cos2 for the individuals
$ind$contrib	Contributions of the individuals
$call	Summary statistics
$call$centre	Mean of the variables
$call$ecart.type	Standard error of the variables
$call$row.w	Weights for the individuals
$call$col.w	Weights for the variables

Eigenvalue percentage of variance cumulative percentage of variance:

comp 1  1.1573559       12.859510                  12.85951
comp 2  1.0991481       12.212757                  25.07227
comp 3  1.0553160       11.725734                  36.79800
comp 4  1.0076069       11.195632                  47.99363
comp 5  0.9841510       10.935011                  58.92864
comp 6  0.9782554       10.869505                  69.79815
comp 7  0.9466867       10.518741                  80.31689
comp 8  0.9172075       10.191194                  90.50808
comp 9  0.8542724       9.491916                   100.00000

Amap package

Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame.

This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention.

For more intricate details, refer to the CRAN-R resource page:

https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf

Library(amap
acp(data,center=TRUE,reduce=TRUE)

Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package:

acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") 
W(x,h,D=NULL,kernel="gaussien")
acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien")

Proportion of variance

We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence.

R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits:

pca_base<-prcomp(data)
print(pca_base)

The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains:

pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100
pr_variance
 [1] 11.678126 11.301480 10.846161 10.482861 10.176036  9.605907  9.498072
 [8]  9.218186  8.762572  8.430598

pr_variance signifies the proportion of variance explained by each component in descending order of magnitude.

Let's calculate the cumulative proportion of variance for the components:

cumsum(pr_variance)
 [1]  11.67813  22.97961  33.82577  44.30863  54.48467  64.09057  73.58864
 [8]  82.80683  91.56940 100.00000

Components 1-8 explain the 82% variance in the data.

Singular vector decomposition

Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing.

It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data:

dimensionality-reduction-img-0

SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD.

Implementation of SVD using R

The following code will show the implementation of SVD using R:

# List of packages for the session
packages = c("foreach", "doParallel", "irlba")
# Install CRAN packages (if not already installed)
inst <- packages %in% installed.packages()
if(length(packages[!inst]) > 0) install.packages(packages[!inst])
# Load packages into session 
lapply(packages, require, character.only=TRUE)
# register the parallel session for 
registerDoParallel(cores=detectCores(all.tests=TRUE))
std_svd <- function(x, k, p=25, iter=0 1 ) { 
m1 <- as.matrix(x)
r <- nrow(m1)
c <- ncol(m1)
p <- min( min(r,c)-k,p)
z <- k+p
m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z)
y <- m1 %*% m2
q <- qr.Q(qr(y))
b<- t(q) %*% m1
#iterations
b1<-foreach( i=i1:iter ) %dopar% { 
  y1 <- m1 %*% t(b)
  q1 <- qr.Q(qr(y1))
  b1 <- t(q1) %*% m1
}
b1<-b1[[iter]]
b2 <- b1 %*% t(b1)
eigens <- eigen(b2, symmetric=T)
result <- list()
result$svalues <- sqrt(eigens$values)[1:k]
u1=eigens$vectors[1:k,1:k]
result$u <- (q %*% eigens$vectors)[,1:k]
result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k]
return(result)
}
svd<- std_svd(x=data,k=5))
# singular vectors
svd$svalues
[1] 35.37645 33.76244 32.93265 32.72369 31.46702

We obtain the following values after running SVD using the IRLBA algorithm:

d: approximate singular values.
u: nu approximate left singular vectors
v: nv approximate right singular vectors
iter: # of IRLBA algorithm iterations
mprod: # of matrix vector products performed

These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed.

Latent factors

# svd$u, svd$v
dim(svd$u)  #u value after running IRLBA
[1] 1000    5
dim(svd$v)  #v value after running IRLBA
[1] 10  5

A modified version of the previous function can be achieved by altering the power iterations for a robust implementation:

foreach( i = 1:iter )%dopar% { 
  y1 <- m1 %*% t(b)
  y2 <- t(y1) %*% y1
  r2 <- chol(y2, pivot = T)
  q1 <- y2 %*% solve(r2)
  b1 <- t(q1) %*% m1
}
b2 <- b1 %*% t(b1)

Some other functions available in R packages are as follows:

Functions	Package
svd()	svd
Irlba()	irlba
svdImpute	bcv

ISOMAP – moving towards non-linearity

ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling.

At a very high level, ISOMAP can be describes in four steps:

Determine the neighbor of each point
Construct a neighborhood graph
Compute the shortest distance path between all pairs
Construct k-dimensional coordinate vectors by applying MDS

Geodesic distance approximation is basically calculated in three ways:

Neighboring points: input-space distance
Faraway points: a sequence of short hops between neighboring points
Method: Finding shortest paths in a graph with edges connecting neighboring data points

source("http://bioconductor.org/biocLite.R")
biocLite("RDRToolbox")
library('RDRToolbox')
swiss_Data=SwissRoll(N = 1000, Plot=TRUE)
x=SwissRoll()
open3d()
plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1)

dimensionality-reduction-img-1

simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE)

dimensionality-reduction-img-2

library(vegan)data(BCI)
distance <- vegdist(BCI)
tree <- spantree(dis)
pl1 <- ordiplot(cmdscale(dis), main="cmdscale")
lines(tree, pl1, col="red")
z <- isomap(distance, k=3)
rgl.isomap(z, size=4, color="red")
pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5")
pl3 <- plot(isomap(distance, k=5), main="isomap k=5")
pl4 <- plot(z, main="isomap k=3")

dimensionality-reduction-img-3

Summary

The idea of this article was to get you familiar with some of the generic dimensionality reduction methods and their implementation using R language. We discussed a few packages that provide functions to perform these tasks. We also covered a few custom functions that can be utilized to perform these tasks. Kudos, you have completed the basics of text mining with R. You must be feeling confident about various data mining methods, text mining algorithms (related to natural language processing of the texts) and after reading this article, dimensionality reduction.

If you feel a little low on confidence, do not be upset. Turn a few pages back and try implementing those tiny code snippets on your own dataset and figure out how they help you understand your data.

Remember this - to mine something, you have to get into it by yourself. This holds true for text as well.