As we discussed in Chapter 7, Cleaning Messy Data, feature encoding is a very useful technique for handling categorical features. Dask also offers encoding methods that have parallel execution capacity. It uses most of the methods that scikit-learn offers:
Encoder | Description |
LabelEncoder | Encodes labels with a value between 0 and 1 that's less than the number of classes available. |
OneHotEncoder | Encodes categorical integer features as a one-hot encoding. |
OrdinalEncoder | Encodes a categorical column as an ordinal variable. |
Let's try using these methods:
# Import Dask DataFrame
import dask.dataframe as dd
# Read CSV file
ddf = dd.read_csv('HR_comma_sep.csv')
# See top 5 records
ddf.head(5)
This results in the following output:
In the preceding code, we read the human resource CSV file using the read_csv() function into a Dask DataFrame. The preceding output only shows some of the columns that are available. However, when you run the notebook...