R: row-wise dplyr::mutate using function that takes a data frame row and returns an integer

user3375672 picture user3375672 · May 30, 2017 · Viewed 15.5k times · Source

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain. Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :

df <- 
  data.frame(exclude=c('B','B','D'), 
             B=c(1,0,0), 
             C=c(3,4,9), 
             D=c(1,1,0), 
             blob=c('fd', 'fs', 'sa'), 
             stringsAsFactors = F)

I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).

FUN <- function(df){
  sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}

When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:

FUN(df[1,])

How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:

df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))

UPDATE This also do not work as intended:

df %>% rowwise(.) %>% mutate(s=FUN(.))

This works of cause but is not within dplyr's mutate (and pipes):

df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))

Answer

konvas picture konvas · May 30, 2017

If you want to use dplyr you can do so using rowwise and your function FUN.

df %>% 
    rowwise %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate

df %>% 
    group_by(1:n()) %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).

A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.

cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)