I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works.
Here is my problem:
set.seed(111)
data1 <- seq(1,10, by=1)
data1
[1] 1 2 3 4 5 6 7 8 9 10
data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE)
data1cut
[1] 1 2 3 4 4 5 5 6 7 7
1. Why did 8,9,10 not included in data1cut result?
2. why did summary(data1) and summary(data1cut) produces different result?
summary(data1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
summary(data1cut)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 4.50 4.40 5.75 7.00
How should i better use cut so that i can create say 4 bins based on the results of summary(data1)?
bin1 [1 -3.25]
bin2 (3.25 -5.50]
bin3 (5.50 -7.75]
bin4 (7.75 -10]
Thank you.
cut
in your example splits the vector into the following parts:
0-1 (1
); 1-2 (2
); 2-3 (3
); 3-5 (4
); 5-7 (5
); 7-8 (6
); 8-10 (7
)
The numbers in brackets are default labels assigned by cut
to each bin, based on the breaks
values provided.
cut
by default is exclusive of the lower range. If you want to change that then you need to specify it in the include.lowest
argument.
You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.
summary(data1)
is a summary of raw data and summary(data1cut)
is a summary of your splits.
You can get the split you need using:
data2cut<-
cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10),
labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"),
include.lowest = TRUE)
The result is the following:
> data2cut
[1] 1-3.25 1-3.25 1-3.25 3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10 7.75-10
[10] 7.75-10
Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10
I hope it's clear now.