Multiple functions over a list of columns and generate new column names automatically with data.table

  • A+
Category:Languages

How to adjust a data table manipulation so that, besides sum per category of several colums, it would also calculate other functions at the same time such as mean and counts (.N) and automatically create column names: "sum c1" , "sum c2", "sum c4" , "mean c1", " mean c2", "mean c4" and preferably also 1 column "counts"?

My old solution was to write out

mean col1 = .... mean col2 = .... 

Etc, Inside the data.table command

Which worked, but horribly inefficient I think, and it won't work anymore to precode it if in the new app version, the calculations depend on user choices in an R Shiny app what to calculate for which columns.

I've read my way through a bunch of posts and blog articles but haven't quite figured out how to best do this. I read that in some cases the manipulation can become quite slow on large data tables depending on what approach you use (.sdcols, get, lapply, and or by =). Therefore I added a 'sizeable' dummy data set

My real data is around 100k rows by 100 columns and 1-100 groups roughly.

library(data.table) n = 100000 dt  = data.table(index=1:100000,                  category = sample(letters[1:25], n, replace = T),                  c1=rnorm(n,10000),                  c2=rnorm(n,1000),                  c3=rnorm(n,100),                  c4 = rnorm(n,10) )  # add more columns to test for big data tables  lapply(c(paste('c', 5:100, sep ='')),        function(addcol) dt[[addcol]] <<- rnorm(n,1000) )  # Simulate columns selected by shiny app user   Colchoice <- c("c1", "c4") FunChoice <- c(".N", "mean", "sum")  # attempt which now does just one function and doesn't add names dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=Colchoice ] 

Expected output is a row per group and a column for each function per each selected column.

Category  Mean c1 Sum c1 Mean c4 ... A B C D E ...... 

Possibly a duplicate but I haven't found the exact answer that I need

 


Here's a data.table answer:

funs_list <- lapply(FunChoice, as.symbol) dcast(dt, category~1, fun=list(mean, sum), value.var = Colchoice) 

It's super fast and does what you want.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: