The columns in a dataset contain the "natural" features. This refers to the data characteristics that are present in the initial dataset, before any feature engineering.
At this point, it is important to make sure the feature definition is clear. For instance, if the dataset contains a column reporting a product price, is this price including the VAT? If the price is set to 0, does it really mean it was free, or is it the default value in case the person or system filling out the data doesn't know the real value? All these questions need to be answered, and involve a lot of communication with the dataset owner.
The column definition is not the only information that needs to be described well. Before going further, you also have to characterize each feature. Two definitions are possible:
- Numerical feature: A feature whose value is an integer (number of floors in a house) or a floating-point number (its surface area). If the feature represents a physical quantity...