Calculating pvalue within a huge data frame takes very long

  • A+
Category:Languages

I am trying to calculate p.values with a students t-test within a very huge data frame in the long data format. Since my original data frame has about lines within the data frame, the calculation of the p.values takes very long (took about 100 Minutes).

I am trying to speed the process up, but I am not sure if the data frame is the best format to increase speed or if I should reshape the data and maybe use a matrix.

Here is some reproducible example with a little data frame and a benchmark at the end.

library(dplyr)  my.t.test <- function (x, y = NULL) {   nx <- length(x)   mx <- mean(x)   vx <- var(x)   ny <- length(y)   my <- mean(y)   vy <- var(y)   stderrx <- sqrt(vx/nx)   stderry <- sqrt(vy/ny)   stderr <- sqrt(stderrx^2 + stderry^2)   df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))   tstat <- (mx - my - 0)/stderr   pval <- 2 * pt(-abs(tstat), df)   return(pval) }  cont <- c("A", "B") set.seed(1) df1 <- data.frame(id=rep(1:1000, each=8),                   replicate=1:4,                   A=rnorm(8000, mean=26, sd=5),                   B=rnorm(8000, mean=25, sd=7))  completeDF <- function() {   df1 %>%   group_by(id) %>%   summarise(Comparison=paste(cont, collapse=' - '),             p.value=t.test(get(cont[1]), get(cont[2]))$p.value,             log10.p.value=-log10(p.value),             log2.foldchange=mean(get(cont[1]), na.rm=TRUE) - mean(get(cont[2]), na.rm=TRUE)   )} noPvalue <- function() {   df1 %>%     group_by(id) %>%     summarise(Comparison=paste(cont, collapse=' - '),               log2.foldchange=mean(get(cont[1]), na.rm=TRUE) - mean(get(cont[2]), na.rm=TRUE)     )} myPvalue <- function() {   df1 %>%     group_by(id) %>%     summarise(Comparison=paste(cont, collapse=' - '),               p.value=my.t.test(get(cont[1]), get(cont[2])),               log10.p.value=-log10(p.value),               log2.foldchange=mean(get(cont[1]), na.rm=TRUE) - mean(get(cont[2]), na.rm=TRUE)     )} microbenchmark::microbenchmark(   completeDF(), noPvalue(), myPvalue() ) 

My benchmark:

Unit: milliseconds          expr       min        lq      mean    median        uq      max neval  completeDF() 358.38330 365.09423 424.60255 369.20453 377.40354 655.2009   100    noPvalue()  57.42996  58.89978  81.86222  59.66851  60.96582 337.2346   100    myPvalue() 216.04812 220.98277 318.09568 224.19516 493.74908 609.4516   100 

So with my very reduced (no test, etc) t.test function, I save already some time. But I am wondering if this can be further improved by vectorising somehow.

 


First of all, replace mean(x) with sum(x) / length(x), as mean is slow.

Then when I profile the updated my.t.test, I find that 80% of its execution time is spent in var. So I replace var with an Rcpp implementation.

library(Rcpp)  cppFunction("double var_cpp (NumericVector x, double xc) {   size_t n = (size_t)x.size();   double z1 = 0.0, z2 = 0.0, *p = &x[0], *q = &x[n];   if (n & 2) {z1 = (*p - xc) * (*p - xc); p++;}   for (; p < q; p += 2) {     z1 += (p[0] - xc) * (p[0] - xc);     z2 += (p[1] - xc) * (p[1] - xc);     }   z1 = (z1 + z2) / (double)(n - 1);   return z1;   }")  library(microbenchmark) x <- runif(1e+7) xc <- sum(x) / length(x) microbenchmark(var_cpp(x, xc), var(x)) #Unit: milliseconds #           expr       min        lq      mean    median        uq       max # var_cpp(x, xc)  20.71985  20.76298  21.00832  20.80576  20.87323  25.85723 #         var(x) 109.61120 109.78513 111.92657 109.89077 114.21301 121.98907 

sum can be boosted as well.

cppFunction("double sum_cpp (NumericVector x) {   size_t n = (size_t)x.size();   double z1 = 0.0, z2 = 0.0, *p = &x[0], *q = &x[n];   if (n & 2) z1 = *p++;   for (; p < q; p += 2) {z1 += p[0]; z2 += p[1];}   z1 = (z1 + z2);   return z1;   }")  microbenchmark(sum_cpp(x), sum(x)) #Unit: milliseconds #       expr      min       lq     mean   median       uq      max neval # sum_cpp(x) 15.58856 15.63613 15.70195 15.67847 15.69998 18.14852   100 #     sum(x) 30.13504 30.20687 30.23993 30.23877 30.26721 30.40525   100 

So these give:

my.t.test.cpp <- function (x, y = NULL) {   nx <- length(x)   mx <- sum_cpp(x) / nx   vx <- var_cpp(x, mx)   ny <- length(y)   my <- sum_cpp(y) / ny   vy <- var_cpp(y, my)   stderrx <- sqrt(vx/nx)   stderry <- sqrt(vy/ny)   stderr <- sqrt(stderrx^2 + stderry^2)   df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))   tstat <- (mx - my - 0)/stderr   pval <- 2 * pt(-abs(tstat), df)   return(pval)   } 

On Martin Morgan's answer

Thanks Martin for converting the dplyr code to R base code. Now I can see better what OP is doing.

Also thanks Martin for adding fcpp in his revision. I have also written down that fcpp myself (close to his). My benchmarking with datasets of different sizes shows that fcpp and f2.2 have the same performance (as his benchmarking shows).

But, we are both bottlenecked by that factor function at the beginning. For OP's data df1 where grouping variable id is 1:1000, we can do class(id) <- "factor"; levels(id) <- 1:1000. In general we might use as.factor which is helpful if grouping variable is already a factor in the data frame. See R: Why use as.factor() instead of just factor().

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: