Creating dummies for categorical variables
In situations where we have categorical variables (factors) but need to use them in analytical methods that require numbers (for example, K nearest neighbors (KNN), Linear Regression), we need to create dummy variables.
Getting ready
Read the data-conversion.csv
file and store it in the working directory of your R environment. Install the dummies
package. Then read the data:
> install.packages("dummies") > library(dummies) > students <- read.csv("data-conversion.csv")
How to do it...
Create dummies for all factors in the data frame:
> students.new <- dummy.data.frame(students, sep = ".") > names(students.new) [1] "Age" "State.NJ" "State.NY" "State.TX" "State.VA" [6] "Gender.F" "Gender.M" "Height" "Income"
The students.new
data frame now contains all the original variables and the newly added dummy variables. The dummy.data.frame()
function has created dummy variables for all four levels of State
and two levels of Gender
factors. However, we will generally omit one of the dummy variables for State
and one for Gender
when we use machine learning techniques.
We can use the optional argument all = FALSE
to specify that the resulting data frame should contain only the generated dummy variables and none of the original variables.
How it works...
The dummy.data.frame()
function creates dummies for all the factors in the data frame supplied. Internally, it uses another dummy()
function which creates dummy variables for a single factor. The dummy()
function creates one new variable for every level of the factor for which we are creating dummies. It appends the variable name with the factor level name to generate names for the dummy variables. We can use the sep
argument to specify the character that separates them; an empty string is the default:
> dummy(students$State, sep = ".") State.NJ State.NY State.TX State.VA [1,] 1 0 0 0 [2,] 0 1 0 0 [3,] 1 0 0 0 [4,] 0 0 0 1 [5,] 0 1 0 0 [6,] 0 0 1 0 [7,] 1 0 0 0 [8,] 0 0 0 1 [9,] 0 0 1 0 [10,] 0 0 0 1
There's more...
In situations where a data frame has several factors, and you plan on using only a subset of them, you create dummies only for the chosen subset.
Choosing which variables to create dummies for
To create a dummy only for one variable or a subset of variables, we can use the names
argument to specify the column names of the variables we want dummies for:
> students.new1 <- dummy.data.frame(students, names = c("State","Gender") , sep = ".")