creating new variable that takes into account prior information from the earlier records

  • A+
Category:Languages

I have data as follows and I want to create new variable that takes into account the preceding information in the prior period. For example,

moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama') name<- c('john', 'john', 'john', 'john', 'john','kate','kate') time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018')   df<- data.frame(moviewatched, name, time) 

Now I need to create a variable that will tell what are the new type of genre movies he/she watched in that month. For example, in the above case, John watched 2 genre types in the first month of 2018 and watched 1 new additional type in second month(as he has already watched comedy and horror in the first month).Is there any way I can create a running count of the new types that person started watching? I want to create a variable called movietypewatched that contains total of the all genre types that person watched till that month. The output expected is as follows:

     name time   movietypewatched       john 1-2018       2      john 2-2018       3      kate 1-2018       1      kate 2-2018       2 

Thanks

 


A solution using dplyr. We can remove duplicated rows based on moviewatched and name, count unique moviewatched, and then use cumsum to calcualte the running total. df2 is the final output.

library(dplyr)  df2 <- df %>%   distinct(moviewatched, name, .keep_all = TRUE) %>%   group_by(name, time) %>%   summarise(movietypewatched = n_distinct(moviewatched)) %>%   mutate(movietypewatched = cumsum(movietypewatched)) %>%   ungroup() df2 # # A tibble: 4 x 3 #   name  time   movietypewatched #   <fct> <fct>             <int> # 1 john  1-2018                2 # 2 john  2-2018                3 # 3 kate  1-2018                1 # 4 kate  2-2018                2 

And here is a data.table solution following the same logic.

library(data.table)  setDT(df) df2 <- df[!duplicated(df[, .(moviewatched, name)])][   , .(movietypewatched = uniqueN(moviewatched)), by = .(name, time)][     , movietypewatched := cumsum(movietypewatched), by = name] df2[] #    name   time movietypewatched # 1: john 1-2018                2 # 2: john 2-2018                3 # 3: kate 1-2018                1 # 4: kate 2-2018                2 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: