Sample n random rows per group in a dataframe

Question 1

Sample n random rows per group in a dataframe

r random dataframe sample

jalapic · May 23, 2014 · Viewed 34.2k times · Source

Answer

Answer

In versions of dplyr 0.3 and later, this works just fine:

df %>% group_by(color) %>% sample_n(size = 3)

Old versions of `dplyr` (version <= 0.2)

I set out to answer this using dplyr, assuming that this would work:

df %.% group_by(color) %.% sample_n(size = 3)

But it turns out that in 0.2 the sample_n.grouped_df S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:

df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color

            X1         X2  color
8   0.66152710 -0.7767473   blue
1  -0.70293752 -0.2372700   blue
2  -0.46691793 -0.4382669   blue
32 -0.47547565 -1.0179842   pink
31 -0.15254540 -0.6149726   pink
39  0.08135292 -0.2141423   pink
15  0.47721644 -1.5033192    red
16  1.26160230  1.1202527    red
12 -2.18431919  0.2370912    red
24  0.10493757  1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow

Presumably this will be fixed in a future update.

Question 2

From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.

Here are some sample data:

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.

To e.g. just sample 3 random rows from 'pink' color - using library(kimisc):

library(kimisc)
sample.rows(subset(df, color == "pink"), 3)

or writing custom function:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)

However, I want to sample 3 (or n) random rows from each level of the factor. I.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution.

Sample n random rows per group in a dataframe

Answer

Old versions of dplyr (version <= 0.2)

Related questions

Old versions of `dplyr` (version <= 0.2)