





















































In this article by Ashish Kumar and Avinash Paul the authors of the book Mastering Text Mining with R, we will Data volume and high dimensions pose an astounding challenge in text mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more sarduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small amount of information and still being able to reduce the high dimension space into a manageable proportion.
(For more resources related to this topic, see here.)
For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance.
Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called curse of dimensionality. Some literature also calls it concentration of measure:
Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures.
Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting.
In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. For instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required.
Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data.
A textual collection is mostly represented in form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the.
Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM).
We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R.
Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs.
PCA comprises the following steps:
Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space.
PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis.
R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns.
prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function.
princomp() fails in situations if the number of variables is larger than the number of observations.
Each function returns a list whose class is prcomp() or princomp().
The information returned and terminology is summarized in the following table:
prcomp() |
princomp() |
Explanation |
sdev |
sdev |
Standard deviation of each column |
Rotations |
Loading |
Principle components |
Center |
Center |
Subtracted value of each row or column to get the center data |
Scale |
Scale |
Scale factors used |
X |
Score |
The rotated data |
n.obs |
Number of observations of each variable |
|
Call |
The call to function that created the object |
Here's a list of the functions available in different R packages for performing PCA:
FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package:
library(FactoMineR)
data<-replicate(10,rnorm(1000))
result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T)
print(result.pca)
Results for the principal component analysis (PCA).
The analysis was performed on 1,000 individuals, described by nine variables.
The results are available in the following objects:
Name |
Description |
$eig |
Eigenvalues |
$var |
Results for the variables |
$var$coord |
coord. for the variables |
$var$cor |
Correlations variables - dimensions |
$var$cos2 |
cos2 for the variables |
$var$contrib |
Contributions of the variables |
$ind |
Results for the individuals |
$ind$coord |
coord. for the individuals |
$ind$cos2 |
cos2 for the individuals |
$ind$contrib |
Contributions of the individuals |
$call |
Summary statistics |
$call$centre |
Mean of the variables |
$call$ecart.type |
Standard error of the variables |
$call$row.w |
Weights for the individuals |
$call$col.w |
Weights for the variables |
Eigenvalue percentage of variance cumulative percentage of variance:
comp 1 1.1573559 12.859510 12.85951
comp 2 1.0991481 12.212757 25.07227
comp 3 1.0553160 11.725734 36.79800
comp 4 1.0076069 11.195632 47.99363
comp 5 0.9841510 10.935011 58.92864
comp 6 0.9782554 10.869505 69.79815
comp 7 0.9466867 10.518741 80.31689
comp 8 0.9172075 10.191194 90.50808
comp 9 0.8542724 9.491916 100.00000
Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame.
This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention.
For more intricate details, refer to the CRAN-R resource page:
https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf
Library(amap
acp(data,center=TRUE,reduce=TRUE)
Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package:
acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien")
W(x,h,D=NULL,kernel="gaussien")
acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien")
We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence.
R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits:
pca_base<-prcomp(data)
print(pca_base)
The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains:
pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100
pr_variance
[1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072
[8] 9.218186 8.762572 8.430598
pr_variance signifies the proportion of variance explained by each component in descending order of magnitude.
Let's calculate the cumulative proportion of variance for the components:
cumsum(pr_variance)
[1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864
[8] 82.80683 91.56940 100.00000
Components 1-8 explain the 82% variance in the data.
Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing.
It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data:
SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD.
The following code will show the implementation of SVD using R:
# List of packages for the session
packages = c("foreach", "doParallel", "irlba")
# Install CRAN packages (if not already installed)
inst <- packages %in% installed.packages()
if(length(packages[!inst]) > 0) install.packages(packages[!inst])
# Load packages into session
lapply(packages, require, character.only=TRUE)
# register the parallel session for
registerDoParallel(cores=detectCores(all.tests=TRUE))
std_svd <- function(x, k, p=25, iter=0 1 ) {
m1 <- as.matrix(x)
r <- nrow(m1)
c <- ncol(m1)
p <- min( min(r,c)-k,p)
z <- k+p
m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z)
y <- m1 %*% m2
q <- qr.Q(qr(y))
b<- t(q) %*% m1
#iterations
b1<-foreach( i=i1:iter ) %dopar% {
y1 <- m1 %*% t(b)
q1 <- qr.Q(qr(y1))
b1 <- t(q1) %*% m1
}
b1<-b1[[iter]]
b2 <- b1 %*% t(b1)
eigens <- eigen(b2, symmetric=T)
result <- list()
result$svalues <- sqrt(eigens$values)[1:k]
u1=eigens$vectors[1:k,1:k]
result$u <- (q %*% eigens$vectors)[,1:k]
result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k]
return(result)
}
svd<- std_svd(x=data,k=5))
# singular vectors
svd$svalues
[1] 35.37645 33.76244 32.93265 32.72369 31.46702
We obtain the following values after running SVD using the IRLBA algorithm:
These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed.
Latent factors
# svd$u, svd$v
dim(svd$u) #u value after running IRLBA
[1] 1000 5
dim(svd$v) #v value after running IRLBA
[1] 10 5
A modified version of the previous function can be achieved by altering the power iterations for a robust implementation:
foreach( i = 1:iter )%dopar% {
y1 <- m1 %*% t(b)
y2 <- t(y1) %*% y1
r2 <- chol(y2, pivot = T)
q1 <- y2 %*% solve(r2)
b1 <- t(q1) %*% m1
}
b2 <- b1 %*% t(b1)
Some other functions available in R packages are as follows:
Functions |
Package |
svd() |
svd |
Irlba() |
irlba |
svdImpute |
bcv |
ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling.
At a very high level, ISOMAP can be describes in four steps:
Geodesic distance approximation is basically calculated in three ways:
source("http://bioconductor.org/biocLite.R")
biocLite("RDRToolbox")
library('RDRToolbox')
swiss_Data=SwissRoll(N = 1000, Plot=TRUE)
x=SwissRoll()
open3d()
plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1)
simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE)
library(vegan)data(BCI)
distance <- vegdist(BCI)
tree <- spantree(dis)
pl1 <- ordiplot(cmdscale(dis), main="cmdscale")
lines(tree, pl1, col="red")
z <- isomap(distance, k=3)
rgl.isomap(z, size=4, color="red")
pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5")
pl3 <- plot(isomap(distance, k=5), main="isomap k=5")
pl4 <- plot(z, main="isomap k=3")
The idea of this article was to get you familiar with some of the generic dimensionality reduction methods and their implementation using R language. We discussed a few packages that provide functions to perform these tasks. We also covered a few custom functions that can be utilized to perform these tasks. Kudos, you have completed the basics of text mining with R. You must be feeling confident about various data mining methods, text mining algorithms (related to natural language processing of the texts) and after reading this article, dimensionality reduction.
If you feel a little low on confidence, do not be upset. Turn a few pages back and try implementing those tiny code snippets on your own dataset and figure out how they help you understand your data.
Remember this - to mine something, you have to get into it by yourself. This holds true for text as well.
Further resources on this subject: