Example of an outlier box plot The data set of N = 90 ordered observations as shown below is examined for outliers:
30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322, 336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451, 453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068, 1441
The above data is available as a text file.
The computations are as follows:
- Median = [n+1]/2 largest data point = the average of the 45th and 46th ordered points = [559 + 560]/2 = 559.5
- Lower quartile = .25[N+1]th ordered point = 22.75th ordered point = 411 + .75[436-411] = 429.75
- Upper quartile = .75[N+1]th ordered point = 68.25th ordered point = 739 +.25[752-739] = 742.25
- Interquartile range = 742.25 - 429.75 = 312.5
- Lower inner fence = 429.75 - 1.5 [312.5] = -39.0
- Upper inner fence = 742.25 + 1.5 [312.5] = 1211.0
- Lower outer fence = 429.75 - 3.0 [312.5] = -507.75
- Upper outer fence = 742.25 + 3.0 [312.5] = 1679.75
From an examination of the fence points and the data, one point [1441] exceeds the upper inner fence and stands out as a mild outlier; there are no extreme outliers.
Should you drop outliers? Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with. Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers. And
since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis. Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier. They can be legitimate observations and are sometimes the most interesting ones. It’s important to investigate the nature of the outlier before deciding. For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs. I knew that was physically impossible. Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier. This also applies to a situation in which you know the datum did not accurately measure what you intended. For example, if you
are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.
Neither the presence nor absence of the outlier in the graph below would change the regression line:
In the following graph, the relationship between X and Y is clearly created by the outlier. Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.
So in those cases where you shouldn’t drop the outlier, what do you do?
One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.
Another option is to try a different model. This should be done with caution, but it may be that a non-linear model fits better. For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.
Whichever approach you take, you need to know your data and your research area well. Try different approaches, and see which make theoretical sense.
Reader Interactions
Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.