Outliers in Machine Learning
Outliers in Machine Learning
A normal distribution is shown below and it is estimated that 68% of the data points lie between +/- 1 standard deviation.
95% of the data points lay between +/- 2 standard deviation 99.7% of the data points lie between +/- 3 standard deviations
Z score and Outliers: If the z score of a data point is more than 3, it indicates that the data point is quite different from the
other data points. Such a data point can be an outlier. For example, in a survey, it was asked how many children a person
had. Suppose the data obtained from people.
We might have outliers because of: Data entry or human errors, damaged or not qualified measurement instruments, data
manipulation, dummies made to test detection methods or to add noise, and finally novelties in data.
Even when you generate random numbers from a distribution(e.g. Gaussian), there will be some rare values that stand far
away from the mean of all other examples. These are the ones we want to get rid of (or analyse in the real world to know
why they are there).
If the calculated value is greater than critical, you can reject the null hypothesis and conclude that one of the values is an outlier
2. Z-score method
Using Z score method, we can find out how many standard deviations value away from the mean.
If the z score of a data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other
values. It is taken as outliers.
3. Robust Z-score
It is also called as Median absolute deviation method. It is similar to Z-score method with some changes in parameters. Since mean and standard
deviations are heavily influenced by outliers, alter to this parameters we use median and absolute deviation from median.
Suppose x follows a standard normal distribution. The MAD will converge to the median of the half normal distribution, which is the 75%
percentile of a normal distribution, and N (0.75)
Number Summary
1. Minimum
2. First Quartile (Q1)
3. Median
4. Third Quartile (Q3)
5. Maximum
It is a clustering algorithm that belongs to the ensemble decision trees family and is similar in principle to Random Forest.
1. It classifies the data point to outlier and not outliers and works great with very high dimensional data.
2. It works based on decision tree and it isolates the outliers.
3. If the result is -1, it means that this specific data point is an outlier. If the result is 1, then it means that the data point is
not an outlier.