scikit-learn toy datasets
scikit-learn provides some built-in datasets that can be used for prototyping purposes because they don't require very long training processes and offer different levels of complexity. They're all available in the sklearn.datasetspackage and have a common structure: the data instance variable contains the whole input set X while the target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have the following:
from sklearn.datasets import load_boston boston = load_boston() X = boston.data Y = boston.target print(X.shape) (506, 13) print(Y.shape) (506,)
In this case, we have 506 samples with 13 features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(...