At this point, the main question we must address is is this data enough to answer the problem? We might ask, are there important features missing? And can we take into account another dataset to add more information to this data?
For instance, if your problem is about predicting house prices and your data contains the house address as typed by the user, the address could be written as 5th Avenue, Fifth avenue, or even Av. 5. In this case, a step of normalization might be necessary so that all addresses have the same format and common addresses can be identified. In addition, it is likely that the location of the address, written in terms of latitude and longitude, is important in order to compute the distance, for example. This would mean that a geocoding step would be necessary.
At this point, you can also check the open data pages that are relevant to your problem. Consider the following:
- States and countries maintain websites to list all publicly available...