Imputing data
Missing values are considered to be the first obstacle in data analysis and predictive modeling. In most statistical analysis methods, list-wise deletion is the default method used to impute missing values, as shown in the earlier recipe. However, these methods are not quite good enough, since deletion could lead to information loss and replacement with simple mean or median, which doesn't take into account the uncertainty in missing values.
Hence, this recipe will show you the multivariate imputation techniques to handle missing values using prediction.
Getting ready
Make sure that the housing-with-missing-value.csv
file from the code files of this chapter is in your R working directory.
You should also install the mice
package using the following command:
> install.packages("mice") > library(mice) > housingData <- read.csv("housing-with-missing-value.csv",header = TRUE, stringsAsFactors = FALSE)
How to do it...
Follow these steps to impute data:
- Perform multivariate imputation:
#imputing only two columns having missing values > columns=c("ptratio","rad") > imputed_Data <- mice(housingData[,names(housingData) %in% columns], m=5, maxit = 50, method = 'pmm', seed = 500) >summary(imputed_Data)
- Generate complete data:
> completeData <- complete(imputed_Data)
- Replace the imputed column values with the
housing.csv
dataset:
> housingData$ptratio <- completeData$ptratio > housingData$rad <- completeData$rad
- Check for missing values:
> anyNA(housingData)
How it works...
As we already know from our earlier recipe, the housing.csv
dataset contains two columns, ptratio
and rad
, with missing values.
The mice
library in R uses a predictive approach and assumes that the missing data is Missing at Random (MAR), and creates multivariate imputations via chained equations to take care of uncertainty in the missing values. It implements the imputation in just two steps: using mice()
to build the model and complete()
to generate the completed data.
The mice()
function takes the following parameters:
- m: It refers to the number of imputed datasets it creates internally. Default is five.
- maxit: It refers to the number of iterations taken to impute the missing values.
- method: It refers to the method used in imputation. The default imputation method (when no argument is specified) depends on the measurement level of the target column and is specified by the
defaultMethod
argument, wheredefaultMethod = c("pmm", "logreg", "polyreg", "polr")
. - logreg: Logistic regression (factor column, two levels).
- polyreg: Polytomous logistic regression (factor column, greater than or equal to two levels).
- polr: Proportional odds model (ordered column, greater than or equal to two levels).
We have used predictive mean matching (pmm) for this recipe to impute the missing values in the dataset.
The anyNA()
function returns a Boolean value to indicate the presence or absence of missing values (NA
) in the dataset.
There's more...
Previously, we used the impute()
function from the Hmisc
library to simply impute the missing value using defined statistical methods (mean, median, and mode). However, Hmisc
also has the aregImpute()
function that allows mean imputation using additive regression, bootstrapping, and predictive mean matching:
> impute_arg <- aregImpute(~ ptratio + rad , data = housingData, n.impute = 5) > impute_arg
argImpute()
automatically identifies the variable type and treats it accordingly, and the n.impute
parameter indicates the number of multiple imputations, where five is recommended.
The output of impute_arg
shows R² values for predicted missing values. The higher the value, the better the values predicted.
Check imputed variable values using the following command:
> impute_arg$imputed$rad