How do I reference the entire row when creating a new column in a data.table?

  • A+
Category:Languages

I have a data.table with more than 200 variables which are all binary. I want to create a new column in it that counts the difference between each row and a reference vector:

#Example dt = data.table( "V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0), "V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0), "V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0), "V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0), "V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0)   )  reference = c(1,1,0,1,0) 

I can do that with a small for loop, such as

distance = NULL for(i in 1:nrow(dt)){         distance[i] = sum(reference != dt[i,])   } 

But it's kind of slow and surely not the best way to do this. I tried:

dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))] dt[,"distance":= sum(reference != .SD)] 

But neither works, as they return the same value for all rows. Also, a solution where I don't have to type all the variable names would be much better, as the real data.table has over 200 columns

 


You can use sweep() with rowSums, i.e.

rowSums(sweep(dt, 2, reference) != 0)  #[1] 2 2 2 2 4 4 3 2 4 3 2 1 3 4 1 3 

BENCHMARK

HUGH <- function(dt) {     dt[, I := .I]      distance_by_I <- melt(dt, id.vars = "I")[, .(distance = sum(reference != value)), keyby = "I"]     return(dt[distance_by_I, on = "I"]) }  Sotos <- function(dt) {     return(rowSums(sweep(dt, 2, reference) != 0)) }  dt1 <- as.data.table(replicate(5, sample(c(0, 1), 100000, replace = TRUE))) microbenchmark(HUGH(dt1), Sotos(dt1))  #Unit: milliseconds #       expr       min        lq      mean   median        uq       max neval cld #  HUGH(dt1) 112.71936 117.03380 124.05758 121.6537 128.09904 155.68470   100   b # Sotos(dt1)  23.66799  31.11618  33.84753  32.8598  34.02818  68.75044   100  a  

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: