Data Discretization
So far, we have done the categorical data treatment using encoding and numerical data treatment using scaling.
Data discretization is the process of converting continuous data into discrete buckets by grouping it. Discretization is also known for easy maintainability of the data. Training a model with discrete data becomes faster and more effective than when attempting the same with continuous data. Although continuous-valued data contains more information, huge amounts of data can slow the model down. Here, discretization can help us strike a balance between both. Some famous methods of data discretization are binning and using a histogram. Although data discretization is useful, we need to effectively pick the range of each bucket, which is a challenge.
The main challenge in discretization is to choose the number of intervals or bins and how to decide on their width.
Here we make use of a function called pandas.cut()
. This function is useful to...