Balanced cross-validation
While splitting the different folds in various datasets, you might wonder: couldn't the different sets in each fold of k-fold cross-validation be very different? The distributions could be very different in each fold, and these differences could lead to volatility in the scores.
There is a solution for this, using stratified cross-validation. The subsets of the dataset will look like smaller versions of the whole dataset (at least in the target variable).
Getting ready
Create a toy dataset as follows:
import numpy as np X = np.array([[1, 2], [3, 4], [5, 6], [7, 8],[1, 2], [3, 4], [5, 6], [7, 8]]) y = np.array([1, 1, 1, 1, 2, 2, 2, 2])
How to do it...
- If we perform 4-fold cross-validation on this miniature toy dataset, each of the four testing folds will have only one value for the target. This can be remedied using
StratifiedKFold
:
from sklearn.model_selection import StratifiedKFold skf = StratifiedKFold(n_splits = 4)
- Print out the indices of the folds:
cc = 1 for train_index...