As we discussed in Chapter 7, Cleaning Messy Data, feature scaling, also known as feature normalization, is used to scale the features at the same level. It can handle issues regarding different column ranges and units. Dask also offers scaling methods that have parallel execution capacity. It uses most of the methods that scikit-learn offers:
Scaler | Description |
MinMaxScaler | Transforms features by scaling each feature to a given range |
RobustScaler | Scales features using statistics that are robust to outliers |
StandardScaler | Standardizes features by removing the mean and scaling them to unit variance |
Let's scale the last_evaluation (employee performance score) column of the human resource dataset:
# Import Dask DataFrame
import dask.dataframe as dd
# Read CSV file
ddf = dd.read_csv('HR_comma_sep.csv')
# See top 5 records
ddf.head(5)
This results in the following output:
In the preceding code, we read the human resource CSV file using...