Correcting data
In practice, raw data is rarely tidy, and is much harder to work with as a result. It is often said that 80 percent of data analysis is spent on the process of cleaning and correcting the data.
In this recipe, you will learn the best way to correctly layout your data to serve two major purposes:
- Making data suitable for software processing, whether that be mathematical functions, visualization, and others
- Revealing information and insights
Getting ready
Download the files for this chapter and store the USArrests.csv
file in your R working directory. You should also install the tidyr
package using the following command.
> install.packages("tidyr") > library(tidyr) > crimeData <- read.csv("USArrests.csv",stringsAsFactors = FALSE)
How to do it...
Follow these steps to correct the data from your dataset.
- View some records of the dataset:
> View(crimeData)
- Add the column name
state
in the dataset:
> crimeData <- cbind(state = rownames(crimeData), crimeData)
- Gather all the variables between
Murder
andUrbanPop
:
> crimeData.1 <- gather(crimeData, key = "crime_type", value = "arrest_estimate", Murder:UrbanPop) > crimeData.1
- Gather all the columns except the column
state
:
> crimeData.2 <- gather(crimeData, key = "crime_type", value = "arrest_estimate", -state) > crimeData.2
- Gather only the
Murder
andAssault
columns:
> crimeData.3 <- gather(crimeData, key = "crime_type", value = "arrest_estimate", Murder, Assault) > crimeData.3
- Spread
crimeData.2
to turn a pair ofkey
:value
(crime_typ
:arrest_estimate
) columns into a set of tidy columns
> crimeData.4 <- spread(crimeData.2, key = "crime_type", value = "arrest_estimate" ) > crimeData.4
How it works...
Correct data format is crucial for facilitating the tasks of data analysis, including data manipulation, modeling, and visualization. The tidy data arranges values so that the relationships in the data parallel the structure of the data frame. Every tidy dataset is based on two basic principles:
- Each variable is saved in its own column
- Each observation is saved in its own row
In the crimeData
dataframe, the row names were states
, hence we used the function cbind()
to add a column named state
in the dataframe. The function gather()
collapses multiple columns into key-value pairs. It makes wide data longer. The gather()
function basically takes four arguments, data (dataframe), key (column name representing new variable), value (column name representing variable values), and names of the columns to gather (or not gather).
In the crimeData.2
data, all column names (except state
) were collapsed into a single key column crime_type
and their values were put into a value column arrest_estimate
.
And, in the crimeData.3
data, the two columns Murder
and Assault
were collapsed and the remaining columns (state
, UrbanPop
, and Rape
) were
duplicated.
The function spread()
does the reverse of gather()
. It takes two columns (key and value) and spreads them into multiple columns. It makes long data wider. The spread()
function takes three arguments in general, data (dataframe), key (column values to convert to multiple columns), and value (single column value to convert to multiple columns' values).
There's more...
Beside the spread()
and gather()
functions, there are two more important functions in the tidyr
package that help to make data tidy.
Combining multiple columns to single columns
The unite()
function takes multiple columns and pastes them together into one column:
> crimeData.5 <- unite(crimeData, col = "Murder_Assault", Murder, Assault, sep = "_") > crimeData.5
We combine the columns Murder
and Assault
from the crimeData
data-frame to generate a new column Murder_Assault
, having the values separated by _
.
Splitting single column to multiple columns
The separate()
function is the reverse of unite()
. It takes values inside a single character column and separates them into multiple columns:
> crimeData.6 <- separate_(crimeData.5, col = "Murder_Assault", into = c("Murder", "Assault"), sep = "_") >crimeData.6