Chapter 1: Introduction to Clustering
Activity 1: Implementing k-means Clustering
Solution:
- Load the Iris data file using pandas, a package that makes data wrangling much easier through the use of DataFrames:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score from scipy.spatial.distance import cdist iris = pd.read_csv('iris_data.csv', header=None) iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']
- Separate out the
X
features and the providedy
species labels, since we want to treat this as an unsupervised learning problem:X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] y = iris['species']
- Get an idea of what our features look like:
X.head()
The output is as follows:
Figure 1.22: First five rows of the data
- Bring back the...