How does cut with breaks work in R

deepseefan picture deepseefan · Aug 24, 2016 · Viewed 24.3k times · Source

I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works.
Here is my problem:

set.seed(111)
data1 <- seq(1,10, by=1)
data1 
[1]  1  2  3  4  5  6  7  8  9 10
data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE)
data1cut
[1] 1 2 3 4 4 5 5 6 7 7

1. Why did 8,9,10 not included in data1cut result?
2. why did summary(data1) and summary(data1cut) produces different result?

summary(data1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    5.50    5.50    7.75   10.00 

summary(data1cut)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    4.50    4.40    5.75    7.00  

How should i better use cut so that i can create say 4 bins based on the results of summary(data1)?

bin1 [1 -3.25]
bin2 (3.25 -5.50]
bin3 (5.50 -7.75]
bin4 (7.75 -10] 

Thank you.

Answer

epo3 picture epo3 · Aug 24, 2016

cut in your example splits the vector into the following parts: 0-1 (1); 1-2 (2); 2-3 (3); 3-5 (4); 5-7 (5); 7-8 (6); 8-10 (7)

The numbers in brackets are default labels assigned by cut to each bin, based on the breaks values provided.

cut by default is exclusive of the lower range. If you want to change that then you need to specify it in the include.lowest argument.

  1. You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.

  2. summary(data1) is a summary of raw data and summary(data1cut) is a summary of your splits.

You can get the split you need using:

data2cut<- 
  cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10),
      labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"),
      include.lowest = TRUE)

The result is the following:

> data2cut

 [1] 1-3.25    1-3.25    1-3.25    3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10   7.75-10  
[10] 7.75-10  
Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10

I hope it's clear now.