Final preprocessing steps
Now that we have gone through all of the variable groups, we are almost ready to build our predictive models. But first, we must expand all of our categorical variables into binary variables (also known as one-hot encoding or a 1-of-K representation) and convert our data into a format suitable for input into the scikit-learn
methods. Let's do that next.
One-hot encoding
Many classifiers of the scikit-learn library require categorical variables to be one-hot encoded. One-hot encoding, or a 1-of-K representation, is when a categorical variable that has more than two possible values is recorded as multiple variables each having two possible values.
For example, let's say that we have five patients in our dataset and we wish to one-hot encode a column that encodes the primary visit diagnosis. Before one-hot encoding, the column looks like this:
|
|
1 | copd |
2 | hypertension |
3 | copd |
4 | chf |
5 | asthma |
After one-hot encoding, this column would be split into K columns, where...