Packt+ | Advance your knowledge in tech

You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Viswanathan

Viswa Viswanathan

View More author details

Table of Contents (20) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Acquire and Prepare the Ingredients - Your Data FREE CHAPTER

2. What's in There - Exploratory Data Analysis

3. Where Does It Belong? Classification

4. Give Me a Number - Regression

5. Can you Simplify That? Data Reduction Techniques

6. Lessons from History - Time Series Analysis

7. How does it look? - Advanced data visualization

8. This may also interest you - Building Recommendations

9. It's All About Your Connections - Social Network Analysis

10. Put Your Best Foot Forward - Document and Present Your Analysis

11. Work Smarter, Not Harder - Efficient and Elegant R Code

12. Where in the World? Geospatial Analysis

13. Playing Nice - Connecting to Other Systems

Correcting data

In practice, raw data is rarely tidy, and is much harder to work with as a result. It is often said that 80 percent of data analysis is spent on the process of cleaning and correcting the data.

In this recipe, you will learn the best way to correctly layout your data to serve two major purposes:

Making data suitable for software processing, whether that be mathematical functions, visualization, and others
Revealing information and insights

Getting ready

Download the files for this chapter and store the USArrests.csv file in your R working directory. You should also install the tidyr package using the following command.

> install.packages("tidyr")
> library(tidyr)
> crimeData <- read.csv("USArrests.csv",stringsAsFactors = FALSE)

How to do it...

Follow these steps to correct the data from your dataset.

View some records of the dataset:

> View(crimeData)

Add the column name state in the dataset:

> crimeData <- cbind(state = rownames(crimeData), crimeData)

Gather all the variables between Murder and UrbanPop:

> crimeData.1 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder:UrbanPop)

 > crimeData.1

Gather all the columns except the column state:

> crimeData.2 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
-state)

 > crimeData.2

Gather only the Murder and Assault columns:

> crimeData.3 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder, Assault)

 > crimeData.3

Spread crimeData.2 to turn a pair of key:value (crime_typ:arrest_estimate) columns into a set of tidy columns

> crimeData.4 <- spread(crimeData.2,
key = "crime_type",
value = "arrest_estimate"
)
 > crimeData.4

How it works...

Correct data format is crucial for facilitating the tasks of data analysis, including data manipulation, modeling, and visualization. The tidy data arranges values so that the relationships in the data parallel the structure of the data frame. Every tidy dataset is based on two basic principles:

Each variable is saved in its own column
Each observation is saved in its own row

In the crimeData dataframe, the row names were states, hence we used the function cbind() to add a column named state in the dataframe. The function gather() collapses multiple columns into key-value pairs. It makes wide data longer. The gather() function basically takes four arguments, data (dataframe), key (column name representing new variable), value (column name representing variable values), and names of the columns to gather (or not gather).

In the crimeData.2 data, all column names (except state) were collapsed into a single key column crime_type and their values were put into a value column arrest_estimate.

And, in the crimeData.3 data, the two columns Murder and Assault were collapsed and the remaining columns (state, UrbanPop, and Rape) were duplicated.

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads them into multiple columns. It makes long data wider. The spread() function takes three arguments in general, data (dataframe), key (column values to convert to multiple columns), and value (single column value to convert to multiple columns' values).

There's more...

Beside the spread() and gather() functions, there are two more important functions in the tidyr package that help to make data tidy.

Combining multiple columns to single columns

The unite() function takes multiple columns and pastes them together into one column:

> crimeData.5 <- unite(crimeData,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
 > crimeData.5

We combine the columns Murder and Assault from the crimeData data-frame to generate a new column Murder_Assault, having the values separated by _.

Splitting single column to multiple columns

The separate() function is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns:

> crimeData.6 <- separate_(crimeData.5,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")
>crimeData.6