scikit-learn toy datasets
scikit-learn provides some built-in datasets that can be used for prototyping purposes because they don't require very long training processes and offer different levels of complexity. They're all available in the sklearn.datasets
package and have a common structure: the data instance variable contains the whole input set X
while the target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have the following:
from sklearn.datasets import load_boston boston = load_boston() X = boston.data Y = boston.target print(X.shape) (506, 13) print(Y.shape) (506,)
In this case, we have 506
samples with 13
features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()
) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(...