The developers of Dask have also reimplemented various k-means clustering algorithms. Let's perform clustering using Dask:
- Read the human resource data into a Dask DataFrame, as follows:
# Read CSV file using Dask
import dask.dataframe as dd
# Read Human Resource Data
ddf = dd.read_csv("HR_comma_sep.csv")
# Let's see top 5 records
ddf.head()
This results in the following output:
In the preceding code, we read the human resource CSV file using the read_csv() function into a Dask DataFrame. The preceding output only shows some of the columns that are available. However, when you run the notebook for yourself, you will be able to see all the columns in the dataset. Now, let's scale the last_evalaution column (last evaluated performance score).
- Next, select the required column for k-means clustering. We have selected the satisfaction_level and last_evaluation columns here:
data=ddf[['satisfaction_level', 'last_evaluation']].to_dask_array...