Group by a column and sort by another column in R

Anand picture Anand · Sep 13, 2016 · Viewed 10.2k times · Source

I am examining the imdb movie dataset in kaggle with R.

Here is a minimal repro dataset:

> movies <- data.frame(movie = as.factor(c("Movie 1", "Movie 2", "Movie 3", "Movie 4")), director = as.factor(c("Dir 1", "Dir 2", "Dir 1", "Dir 3")), director_rating =  c(1000, 2000, 1000, 3000))

> movies
    movie director director_rating
1 Movie 1    Dir 1            1000
2 Movie 2    Dir 2            2000
3 Movie 3    Dir 1            1000
4 Movie 4    Dir 3            3000

Note that every row that has the same director has the same value of rating for the director.

I want to list the directors, sorted by rating, and one row per director. The following code works:

> library(dplyr)
> movies %>% 
  group_by(director) %>%
  summarize(director_rating = mean(director_rating)) %>%
  arrange(desc(director_rating))

    # A tibble: 3 x 2
  director director_rating
    <fctr>           <dbl>
1    Dir 3            3000
2    Dir 2            2000
3    Dir 1            1000

But it seems wasteful to compute the mean when I know that all the ratings for a single director are identical. What is a more idiomatic/efficient way to do this in R?

Answer

talat picture talat · Sep 13, 2016

There's actually no need to group and summarise, since you are just looking for distinct / unique entries. A dplyr option is therefore:

select(movies, -movie) %>% 
  distinct() %>% 
  arrange(desc(director_rating))
#  director director_rating
#1    Dir 3            3000
#2    Dir 2            2000
#3    Dir 1            1000

Or in case you like to keep other columns:

distinct(movies, director, .keep_all = TRUE) %>%   # for dplyr >= 0.5.0
  arrange(desc(director_rating))
#    movie director director_rating
#1 Movie 4    Dir 3            3000
#2 Movie 2    Dir 2            2000
#3 Movie 1    Dir 1            1000