Outliers appear in a dataset for two main reasons:
- Human error: If the values are typed by a human being, it is likely that this person will make mistakes from time to time. They might type an extra zero at the end of a number, or invert two numbers so that we end up with a price of $91 instead of $19 for some products.
- Rare observation: Although almost all of your products cost less than $100, you may have some more expensive ones, up to $1,000 or maybe more. Trying to model both usual and rare events is often complicated. Therefore, if it is not of particular importance to you; it is better to leave rare events out of the model.
Sometimes, the outliers are actually the anomalies you are trying to identify – for example, for fraud or intrusion detection in a network. Several methods exist to identify outliers and deal with them; some are very simple and some more sophisticated.
In our example, we will use an oversimplified method consisting of replacing values...