Sorting a list of random transactions using dplyr

  • A+
Category:Languages

Assume the following set of original transactions:

library(tidyverse)  original_transactions <- data.frame(   row = 1:6,   start = 0,   change = runif(6, min = -10, max = 10) %>% round(2),   end = 0 ) %>% mutate(   temp = cumsum(change),   end = 100 + temp, # End balance   start = end - change # Start balance ) %>% select(   -temp ) 

Sorting a list of random transactions using dplyr

It shows a (chronological) sequence of transactions with a starting balance of $100.00 and an ending balance of $95.65, with six transactions/changes.

Now assume that you receive a jumbled version of this

transactions <- original_transactions %>% sample_n(   6 ) %>% mutate(   row = row_number() # Original sequence is unknown ) 

Sorting a list of random transactions using dplyr

How can I reverse-engineer the sequence in R? That is, to get the sort order of transactions to match that of original_transactions? Ideally I'd like to do this using dplyr and a sequence of pipes %>% and avoid loops.

Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.

 


First, let

original_transactions #   row  start change    end # 1   1 100.00   2.33 102.33 # 2   2 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 4   4  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58  transactions #   row  start change    end # 1   1 100.00   2.33 102.33 # 2   2  91.61  -3.56  88.05 # 3   3  95.81  -4.20  91.61 # 4   4 102.33  -6.52  95.81 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 

and

diffs <- outer(transactions$start, transactions$start, `-`) matches <- abs(sweep(diffs, 2, transactions$change, `-`)) < 1e-3 

I guess that computing diffs is the most computationally expensive part in the whole solution. diffs has all possible differences between start of your transactions. Then comparing those with the change column in matches we know which pairs of rows of transactions should go together. If there were no problems regarding numeric precision, we could then use the match function and be done quickly. In this case, however, we have the following two options.


First, we may use igraph.

library(igraph) (g <- graph_from_adjacency_matrix(t(matches) * 1)) # IGRAPH 45d33f0 D--- 6 5 --  # + edges from 45d33f0: # [1] 1->4 2->5 3->2 4->3 5->6 

That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is 1):

transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[[1]]), ] #   row  start change    end # 1   1 100.00   2.33 102.33 # 4   4 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 2   2  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 

Another option is recursive.

fun <- function(x, path = x) {   if(length(xNew <- which(matches[, x])) > 0)     fun(xNew, c(path, xNew))   else path } transactions[fun(which(rowSums(matches) == 0)), ] #   row  start change    end # 1   1 100.00   2.33 102.33 # 4   4 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 2   2  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 

It uses the same unique longest path graph idea as the previous approach.


No explicit loops... And of course you may rewrite everything with %>%, but it won't be as pretty as you want; this is not really a traditional data transformation task where dplyr is best.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: