Reducing overfitting with cross-validation
Here, we will use cross-validation on the diabetes dataset from the previous recipe to improve performance. Start by loading the dataset, as in the previous recipe:
%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes diabetes = load_diabetes() X = diabetes.data y = diabetes.target X_feature_names = ['age', 'gender', 'body mass index', 'average blood pressure','bl_0','bl_1','bl_2','bl_3','bl_4','bl_5'] bins = 50*np.arange(8) binned_y = np.digitize(y, bins) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=binned_y)
How to do it...
- Use grid search to reduce overfitting. Import a decision tree and instantiate it:
from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor()
- Then, import
GridSearchCV
and instantiate this class:
from sklearn.model_selection import GridSearchCV...