Managing categorical data
In many classification problems, the target dataset is made up of categorical labels that cannot immediately be processed by every algorithm. An encoding is needed, and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10
categorical samples with 2
features each:
import numpy as np X = np.random.uniform(0.0, 1.0, size=(10, 2)) Y = np.random.choice(('Male', 'Female'), size=(10)) print(X[0]) array([ 0.8236887 , 0.11975305]) print(Y[0]) 'Male'
The first option is to use the LabelEncoder
class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is, an index of an instance array called classes_
:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() yt = le.fit_transform(Y) print(yt) [0 0 0 1 0 1 1 0 0 1] le.classes_array(['Female', 'Male'], dtype='|S6')
The inverse transformation can be obtained in this simple way:
output = [1, 0, 1, 1, 0, 0] decoded_output...