To perform parallel computing using scikit-learn on a single CPU, we need to use joblib. This makes scikit-learn operations parallel computable. The joblib library performs parallelization on Python jobs. Dask can help us perform parallel operations on multiple scikit-learn estimators. Let's take a look:
- First, we need to read the dataset. We can load the dataset using a pandas DataFrame, like so:
# Import Dask DataFrame
import pandas as pd
# Read CSV file
df = pd.read_csv('HR_comma_sep.csv')
# See top 5 records
df.head(5)
This results in the following output:
In the preceding code, we read the human resource CSV file using the read_csv() function into a Dask DataFrame. The preceding output only shows some of the columns that are available. However, when you run the notebook for yourself, you will be able to see all the columns in the dataset. Now, let's scale the last_evalaution column (last evaluated performance score).
- Next...