Preprocessing the predictor variables
Let's take a look at specific groups of predictor variables that commonly pop up in healthcare data.
Visit information
The first feature category in the ED2013 dataset contains information about the timing of the visit. Variables such as month, day of week, and arrival time are included here. Also included are the waiting time and length of visit variables (both in minutes).
Month
Let's analyze the VMONTH
predictor in more detail. The following code prints all the values in the training set and their counts:
print(X_train.groupby('VMONTH').size())
The output is as follows:
VMONTH 01 1757 02 1396 03 1409 04 1719 05 2032 06 1749 07 1696 08 1034 09 1240 10 1306 11 1693 12 1551 dtype: int64
We can now see that the months are numbered from 01
to 12
, as it says in the documentation, and that each month has representation.
One part of preprocessing the data is performing feature engineering – that is, combining or transforming the...