In this chapter, we will use a simple decision tree classifier. It can be trained with scikit-learn using the following:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=123, min_samples_leaf=10)
clf.fit(X_train, y_train)
However, if you run this code on our current dataset, you will receive some errors because a decision tree does not know how to handle NaN or missing data, and we have a couple of rows with missing information.
In order to fill these NaN values, we will use a SimpleImputer model, which will replace the NaN values with the mean value of each feature. Following the scikit-learn API, we need to train the transformer on our train sample:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
imp.fit(X_train)
We then need to actually perform the transformation, on both our training and test samples:
X_train = imp.transform(X_train)
X_test = imp.transform(X_test)
Once the data has...