I have data as follows and I want to create new variable that takes into account the preceding information in the prior period. For example,
moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama') name<- c('john', 'john', 'john', 'john', 'john','kate','kate') time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018') df<- data.frame(moviewatched, name, time)
Now I need to create a variable that will tell what are the new type of genre movies he/she watched in that month. For example, in the above case, John watched 2 genre types in the first month of 2018 and watched 1 new additional type in second month(as he has already watched comedy and horror in the first month).Is there any way I can create a running count of the new types that person started watching? I want to create a variable called movietypewatched that contains total of the all genre types that person watched till that month. The output expected is as follows:
name time movietypewatched john 1-2018 2 john 2-2018 3 kate 1-2018 1 kate 2-2018 2
A solution using
dplyr. We can remove duplicated rows based on
name, count unique
moviewatched, and then use
cumsum to calcualte the running total.
df2 is the final output.
library(dplyr) df2 <- df %>% distinct(moviewatched, name, .keep_all = TRUE) %>% group_by(name, time) %>% summarise(movietypewatched = n_distinct(moviewatched)) %>% mutate(movietypewatched = cumsum(movietypewatched)) %>% ungroup() df2 # # A tibble: 4 x 3 # name time movietypewatched # <fct> <fct> <int> # 1 john 1-2018 2 # 2 john 2-2018 3 # 3 kate 1-2018 1 # 4 kate 2-2018 2
And here is a
data.table solution following the same logic.
library(data.table) setDT(df) df2 <- df[!duplicated(df[, .(moviewatched, name)])][ , .(movietypewatched = uniqueN(moviewatched)), by = .(name, time)][ , movietypewatched := cumsum(movietypewatched), by = name] df2 # name time movietypewatched # 1: john 1-2018 2 # 2: john 2-2018 3 # 3: kate 1-2018 1 # 4: kate 2-2018 2