R/dplyr: Using a loop to create lags and calculate cumulative sums based on column names

  • A+
Category:Languages

I want to loop through a long list of columns in a large-ish dataframe and calculate cumulative sums on the columns' lagged values. Put in other words, I'm kind of calculating how much had been "done" prior to each observation.

Toy dataframe to help make this clearer.

id = c("a", "a", "a", "b", "b") date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days") v1 = sample(seq(1, 20), 5) v2 = sample(seq(1, 20), 5) df = data.frame(id, date, v1, v2) 

I want it to look like

id   date         v1   v2   v1Cum   v2Cum a    2015-12-01   1    13     0       0 a    2015-12-02   7    11     1       13 a    2015-12-03   12   2      8       24 b    2015-12-04   18   6      0       0 b    2015-12-05   4    9      18      6 

So it's not a cumulative sum of v1 or v2 within the id groups, but rather a cumulative sum of each id's lagged values.

I can do this on individual columns no problem, but I can't seem to generalize it with a loop:

vars = c("v1", "v2") for (var in vars) {   lagname = paste(var, "Lag", sep="")   cumname = paste(var, "Cum", sep="")   df = arrange(df, id, date)   df = df %>%      group_by(id) %>%      mutate(!!lagname := dplyr::lag(var, n = 1, default = NA))   df[[lagname]] = ifelse(is.na(df[[lagname]]), 0, df[[lagname]])   df = df %>% group_by(id) %>% arrange(date) %>% mutate(!!cumname := cumsum(!!lagname)) } 

The problems, as I see them, are

  • the lag variable just evaluates to NA (or 0 after the ifelse()). I know I haven't quite nailed the mutate().
  • the cumulative summing is evaluating to NA

Any ideas? Thanks for the help! (I'm trying to get back into coding after a break of a couple years. My primary "language" was Stata, however, so I imagine that I'm approaching this a bit wonkily. Happy to revise this completely!)

 


If I understand you correctly, the following should work:

Reproducible sample data (with 3 variables for summing):

set.seed(123) df = data.frame(   id = c("a", "a", "a", "b", "b"),   date = seq(as.Date("2015-12-01"), as.Date("2015-12-05"), by="days"),   v1 = sample(seq(1, 20), 5),   v2 = sample(seq(1, 20), 5),   v3 = sample(seq(1, 20), 5) )  > df   id       date v1 v2 v3 1  a 2015-12-01  6  1 20 2  a 2015-12-02 15 11  9 3  a 2015-12-03  8 17 13 4  b 2015-12-04 16 10 10 5  b 2015-12-05 17  8  2 

Group by id, sort by date (in case they aren't in sequence), & mutate for all named variables between the two named ones (v1:v3 in this case):

df %>%   group_by(id) %>%   arrange(date) %>%   mutate_at(vars(v1:v3), funs(Cum = cumsum(lag(., default = 0)))) %>%   ungroup()   # A tibble: 5 x 8 # Groups: id [2]   id     date          v1    v2    v3 v1_Cum v2_Cum v3_Cum   <fctr> <date>     <int> <int> <int>  <int>  <int>  <int> 1 a      2015-12-01     6     1    20      0      0      0 2 a      2015-12-02    15    11     9      6      1     20 3 a      2015-12-03     8    17    13     21     12     29 4 b      2015-12-04    16    10    10      0      0      0 5 b      2015-12-05    17     8     2     16     10     10 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: