Working with categorical variables
Categorical variables are a problem. On one hand they provide valuable information; on the other hand, it's probably text—either the actual text or integers corresponding to the text—such as an index in a lookup table.
So, we clearly need to represent our text as integers for the model's sake, but we can't just use the id field or naively represent them. This is because we need to avoid a similar problem to the Creating binary features through thresholding recipe. If we treat data that is continuous, it must be interpreted as continuous.
Getting ready
The Boston dataset won't be useful for this section. While it's useful for feature binarization, it won't suffice for creating features from categorical variables. For this, the iris dataset will suffice.
For this to work, the problem needs to be turned on its head. Imagine a problem where the goal is to predict the sepal width; in this case, the species of the flower will probably be useful as a feature.