Encoding categorical variables
To recap, thus far we have successfully imputed our dataset—both our categorical and quantitative columns. At this point, you may be wondering, how do we utilize the categorical data with a machine learning algorithm?
Simply put, we need to transform this categorical data into numerical data. So far, we have ensured that the most common category was used to fill the missing values. Now that this is done, we need to take it a step further.
Any machine learning algorithm, whether it is a linear-regression or a KNN-utilizing Euclidean distance, requires numerical input features to learn from. There are several methods we can rely on to transform our categorical data into numerical data.
Encoding at the nominal level
Let's begin with data at the nominal level. The main method we have is to transform our categorical data into dummy variables. We have two options to do this:
- Utilize pandas to automatically find the categorical variables and dummy code them
- Create our...