Incremental sequences with interruptions

  • A+
Category:Languages

I have a dataset with repeating sequences of TRUE that I would like to label based on some conditions - by id, and by the sequence's incremental value. A FALSE breaks the sequence of TRUEs and the first FALSE that breaks any given sequence of TRUE should be included in that sequence. Consecutive FALSEs in between TRUEs are irrelevant and are labeled 0.

For example:

> test    id logical sequence 1   1    TRUE        1 2   1    TRUE        1 3   1   FALSE        1 4   1    TRUE        2 5   1    TRUE        2 6   1   FALSE        2 7   1    TRUE        3 8   2    TRUE        1 9   2    TRUE        1 10  2    TRUE        1 11  2   FALSE        1 12  2    TRUE        2 13  2    TRUE        2 14  2    TRUE        2 15  3   FALSE        0 16  3   FALSE        0 17  3   FALSE        0 18  3    TRUE        1 19  3   FALSE        1 20  3    TRUE        2 21  3   FALSE        2 22  3   FALSE        0 23  3   FALSE        0 24  3   FALSE        0 25  3    TRUE        3 

And so on. I have considered using rle() which produces

> rle(test$logical) Run Length Encoding   lengths: int [1:13] 2 1 2 1 4 1 3 3 1 1 ...   values : logi [1:13] TRUE FALSE TRUE FALSE TRUE FALSE ... 

But I am not sure how to map this back on the data frame. Any suggestions on how to approach this problem?

Here are the sample data:

> dput(test) structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,  2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), logical = c(TRUE, TRUE,  FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,  TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE,  FALSE, FALSE, TRUE)), .Names = c("id", "logical"), class = "data.frame", row.names = c(NA,  -25L)) 


A pure data.table solution:

# load the 'data.table'-package & convert 'test' to a data.table with 'setDT' library(data.table) setDT(test)  # calculate the new sequence test[, new_seq := (rleid(logical) - !logical) * !(!logical & !shift(logical, fill = FALSE)), by = id      ][new_seq != 0, new_seq := rleid(new_seq), by = id][] 

which gives:

    id logical new_seq  1:  1    TRUE       1  2:  1    TRUE       1  3:  1   FALSE       1  4:  1    TRUE       2  5:  1    TRUE       2  6:  1   FALSE       2  7:  1    TRUE       3  8:  2    TRUE       1  9:  2    TRUE       1 10:  2    TRUE       1 11:  2   FALSE       1 12:  2    TRUE       2 13:  2    TRUE       2 14:  2    TRUE       2 15:  3   FALSE       0 16:  3   FALSE       0 17:  3   FALSE       0 18:  3    TRUE       1 19:  3   FALSE       1 20:  3    TRUE       2 21:  3   FALSE       2 22:  3   FALSE       0 23:  3   FALSE       0 24:  3   FALSE       0 25:  3    TRUE       3 

What this does:

  • rleid(logical) - !logical creates a numeric run length id and substracts 1 for where logical is equal to FALSE
  • The result of the previous step is then multiplied with the result of !(!logical & !shift(logical, fill = FALSE)), which is a TRUE/FALSE vector for consequtive FALSE values except the first one of a FALSE-sequence.
  • Finally, we create a new run length id for only the rows where new_seq is not equal to 0 and have your desired result.

A slightly improved alternative (as suggested by @jogo in the comments):

test[, new_seq := (rleid(logical) - !logical) * (logical | shift(logical, fill = FALSE)), by = id      ][new_seq != 0, new_seq := rleid(new_seq), by = id][] 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: