I have to following issue using R. In short I want to create multiple new columns in a data frame based on calculations of different column pairs in the data frame.
The data looks as follows:
df <- data.frame(a1 = c(1:5),
b1 = c(4:8),
c1 = c(10:14),
a2 = c(9:13),
b2 = c(3:7),
c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1 4 10 9 3 15
2 5 11 10 4 16
3 6 12 11 5 17
4 7 13 12 6 18
5 8 14 13 7 19
The output is supposed to look like the following:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 4 10 9 3 15 10 7 25
2 5 11 10 4 16 12 9 27
4 7 13 12 6 18 16 13 31
5 8 14 13 7 19 18 15 33
I can achieve this using dplyr doing some manual work in the following way:
df %>% rowwise %>% mutate(sum_a = sum(a1, a2),
sum_b = sum(b1, b2),
sum_c = sum(c1, c2)) %>%
as.data.frame()
So what is being done is: take columns with the letter "a" in it, calulate the sum rowwise, and create a new column with the sum named sum_[letter]. Repeat for columns with different letters.
This is working, however, if I have a large data set with say 300 different column pairs the manual input would be significant, since I would have to write 300 mutate calls.
I recently stumbled upon the R package "purrr" and my guess is that this would solve my problem of doing what I want in a more automated way.
In particular, I would think to be able to use purrr:map2 to which I pass two lists of column names.
Then I could calculate the sum of each matching list entry, in the form of:
map2(list1, list2, ~mutate(sum))
However, I am not able to figure out how to best approach this problem using purrr. I am rather new to using purrr, so I would really appreciate any help on this issue.
Here is one option with purrr
. We get the unique
prefix of the names
of the dataset ('nm1'), use map
(from purrr
) to loop through the unique names, select
the column that matches
the prefix value of 'nm1', add the rows using reduce
and the bind the columns (bind_cols
) with the original dataset
library(tidyverse)
nm1 <- names(df) %>%
substr(1, 1) %>%
unique
nm1 %>%
map(~ df %>%
select(matches(.x)) %>%
reduce(`+`)) %>%
set_names(paste0("sum_", nm1)) %>%
bind_cols(df, .)
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33