How can I identify the labels of outliers in a R boxplot?

static_rtti picture static_rtti · Jun 21, 2012 · Viewed 28.3k times · Source

The R boxplot function is a very useful way to look at data: it quickly provides you with a visual summary of the approximate location and variance of your data, and the number of outliers. In addition, I'd like to identify the outliers, in order to quickly find problems in the dataset.

The values of these outliers can be accessed using myplot$out. Unfortunately, the labels of these outliers seem to be unavailable. There are some packages aimed at displaying the labels on the plot itself: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/, but they don't work well and I just want to list these outliers, I don't need them to be on the plot itself.

Any ideas?

Answer

csgillespie picture csgillespie · Jun 21, 2012

You've done most of the hard work yourself. All that is remaining is a comparison:

##First create some data 
##You should include this in your question)
set.seed(2)
dd = data.frame(x = rlnorm(26), y=LETTERS)

Grab the outliers

outliers = boxplot(dd$x, plot=FALSE)$out

Extract the outliers from the original data frame

dd[dd$x %in% outliers,]

Further explanation:

The variable dd$x is the vector of 26 numbers. The variable outliers contains the values of the outliers (just type dd$x and outliers in your R console). The command

dd$x %in% outliers

matches the values of dd$x and outliers, viz:

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE <snip>

The square bracket notation, dd[dd$x %in% outliers,] returns the rows of the data frame dd, where dd$x %in% outliers return TRUE.