Managing categorical data
In many classification problems, the target dataset is made up of categorical labels that cannot immediately be processed by every algorithm. An encoding is needed, and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with 2 features each:
import numpy as np
X = np.random.uniform(0.0, 1.0, size=(10, 2))
Y = np.random.choice(('Male', 'Female'), size=(10))
print(X[0])
array([ 0.8236887 , 0.11975305])
print(Y[0])
'Male'The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is, an index of an instance array called classes_:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() yt = le.fit_transform(Y) print(yt) [0 0 0 1 0 1 1 0 0 1] le.classes_array(['Female', 'Male'], dtype='|S6')
The inverse transformation can be obtained in this simple way:
output = [1, 0, 1, 1, 0, 0] decoded_output...