Discretizing continuous variables
Some algorithms, for example, the Naive Bayes algorithm that will be introduced in Chapter 8, use discrete input variables only. If you want to use a continuous variable in your analysis, you have to discretize it, or bin the values. You might also want to discretize a continuous variable just to be able to show its distribution with a bar chart. There are many possible ways to do the discretization. I will show the following ones:
- Equal width binning
- Equal height binning
- Custom binning
Equal width discretization
Equal width binning is probably the most popular way of doing discretization. This means that after the binning, all bins have equal width, or represent an equal range of the original variable values, no matter how many cases are in each bin. With enough bins, you can preserve the original distribution quite well, and represent it with a bar chart.
I will start with a T-SQL example. I will bin the Age
variable from the dbo.vTargetMail
view in five groups...