First of all, we should split our dataset into train and test samples, respecting the class repartitions in both samples:
from sklearn.model_selection import train_test_split
X = df[["score"]]
y = df.label
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
# make sure both the train and test samples are representative
# of the whole dataset in terms of class unbalance
stratify=y
)
As we noticed earlier, our dataset is unbalanced. We can use some sampling techniques to restore class balance in the training set:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=SEED)
X_train, y_train = rus.fit_resample(X_train, y_train)
In order to compute FPR and TPR at different thresholds, we will use a scikit-learn function that will do so for us:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train, X_train.score)
To plot...