Packt+ | Advance your knowledge in tech

You're reading from Practical Data Wrangling Expert techniques for transforming your raw data into a valuable source for analytics

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781787286139

Length 204 pages

Edition 1st Edition

Languages

Python

Tools

RStudio

Concepts

Data Analysis

Author (1):

Visochek

View More author details

Table of Contents (16) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Programming with Data FREE CHAPTER

2. Introduction to Programming in Python

3. Reading, Exploring, and Modifying Data - Part I

4. Reading, Exploring, and Modifying Data - Part II

5. Manipulating Text Data - An Introduction to Regular Expressions

6. Cleaning Numerical Data - An Introduction to R and RStudio

7. Simplifying Data Manipulation with dplyr

8. Getting Data from the Web

9. Working with Large Datasets

Handling NA values

Sometimes, it is acceptable to have NA values in the dataset. However, for many types of analysis, NA values need to be either removed or replaced. In the case of road length, a better estimate of total road length could be generated if the NA values were replaced with best guesses. In the following subsections, I will walk through these three approaches to handling NA values:

Deletion
Insertion
Imputation

Deleting missing values

The simplest way to handle NA values is to delete any entry that contains an NA value, or a certain number of NA values. When removing entries with NA values, there is a trade-off between the correctness of the data and the completeness of the data. Data entries that contain NA values may also contain several useful non-NA values, and and removing too many data entries could reduce the dataset to a point where it is no longer useful.

For this dataset, it is not that important to have all of the years present; even one year is enough to give us a rough idea of how much road length is in the particular region at any point over the 12 years. A safe approach for this particular application would be to remove all of the rows where all of the values are NA.

A quick shortcut to finding the rows for which all values are NA is to use the rowSums() function. The rowSums() function finds the sum of each row, and takes a parameter to ignore NA values. The following finds the sum of the non-NA values in the roads.num2 dataframe:

roads.num2.rowsums <- rowSums(roads.num2,na.rm=TRUE)

Because NA values are ignored, in the resulting vector of row sums, a 0 corresponds to either a row with all NA values or a region with no roads. In either case, a 0 value corresponds to a row that is not important and can be filtered out. The following creates an index that can be used to filter out all such rows:

roads.keep3 <- roads.num2.rowsums > 0

In the following continuation of r_intro.R, the roads.keep3 vector is used to filter out the rows that have either all NA values or 0 roads:

roads3 <- roads2[roads.keep2,]
roads.num3 <- roads.num2[roads.keep2,]
roads.means3 <- roads.means3

Next, I will do a quick demonstration of another approach to NA handling, replacing the values with a constant.

Replacing missing values with a constant

Replacing all NA values with a constant is actually rather simple. A dataframe can be indexed using another dataframe of logical values of the same dimension. The following will create a new dataframe that is a copy of roads3 and replace all of the NA values with 0:

roads.replace.na <- roads3
roads.replace.na[is.na(roads3)] <- 0

In this chapter, I won't use the dataframe with replaced NA values which was just created, so this is just for demonstration purposes. A more effective way to handle NA values, when possible, is to replace the missing value with an estimate based on existing data.

Imputation of missing values

A good guess for the missing values is the mean value of the non-NA values in the same row (in a particular region), since the total length of road doesn't change all that much year to year.

In the following continuation of r_intro.R, the row means in indices corresponding to NA values in 2011 are extracted from the roads.means3 vector. The extracted row means are then assigned to the indices of the roads.2011.3 vector which correspond to NA values:

roads.2011.3 <- roads3$X2011
roads.2011.3[is.na(roads.2011.3)] <- roads.means3[is.na(roads.2011.3)]
print(sum(roads.2011.3))

This results in a much better estimate of the total roads length as of 2011. It is possible to go even further however to get a similar estimate for each column.

There are a number of ways to approach the task of getting an estimate for each column. The approach I will take is to go column by column and replace each of the NA values in that column with the corresponding mean value for the corresponding row.

This can be done with the apply() function, which applies a function to each column. Before using the apply()function, you will need to create the function that is applied to each column. Functions in R work similarly to functions in Python, but have a different syntax. The following is the syntax for a function in R:

my.function <- function(<arguments>){
    <code block>
    return(result)
}

The following is a function called impute() which takes two arguments: a vector that is a column of a dataframe, and a vector of equal length that contains the imputation values for each row. The impute function returns the original dataframe column where the NA values have been replaced with the corresponding imputation values:

impute <- function(x,imputations) {
    x[is.na(x)] <- imputations[is.na(x)]
    return(x)
}

The apply() function takes as its first argument a dataframe, and its third argument a function. The second argument to the apply() function is a 1 if the function should be applied to each row, or a 2 if the function should be applied to each column. After the third argument, all additional arguments to the apply() function are passed into the function which is specified in the third argument. The apply() function returns a data type called a matrix, so the result will need to be converted back to a dataframe using the data.frame() function.

In the following continuation of r_intro.R the apply() function is used to go column by column, run the impute function on each column, and return a result with the imputed values:

## apply the impute function to each column with apply()
roads.impute.na <- data.frame(
 apply(roads3,2,impute,imputations=roads.means3)
)
print(colSums(roads.impute.na))

The resulting dataframe--roads.impute.na--is now a dataframe containing imputed values in the place of the NA values. Printing the column sums with the colSums() function should reveal the estimated total road length for each year in the console output: