Exploration – data analysis
Now, it is time to explore the data. There are many questions that we can ask, such as the following:
- What target features would we like to model supporting our goals?
- What are the useful training features for each target feature?
- Which features are not good for modeling since they leak information about target features (see the previous section)?
- Which features are not useful (for example, constant features, or features containing lot of missing values)?
- How to clean up data? What to do with missing values? Can we engineer new features?
Basic clean up
During data exploration, we will execute basic data clean up. In our case, we can utilize the power of booth tools together: we use the H2O Flow UI to explore the data, find suspicious parts of the data, and transform them directly with H2O, or, even better, with Spark.
Useless columns
The first step is to remove columns that contain unique values per line. Typical examples of this are user IDs or transaction IDs. In our...