- A+

Assume the following set of original transactions:

`library(tidyverse) original_transactions <- data.frame( row = 1:6, start = 0, change = runif(6, min = -10, max = 10) %>% round(2), end = 0 ) %>% mutate( temp = cumsum(change), end = 100 + temp, # End balance start = end - change # Start balance ) %>% select( -temp ) `

It shows a (chronological) sequence of transactions with a starting balance of $100.00 and an ending balance of $95.65, with six transactions/changes.

Now assume that you receive a jumbled version of this

`transactions <- original_transactions %>% sample_n( 6 ) %>% mutate( row = row_number() # Original sequence is unknown ) `

How can I reverse-engineer the sequence in R? That is, to get the sort order of `transactions`

to match that of `original_transactions`

? Ideally I'd like to do this using `dplyr`

and a sequence of pipes `%>%`

and avoid loops.

Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.

First, let

`original_transactions # row start change end # 1 1 100.00 2.33 102.33 # 2 2 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 4 4 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 transactions # row start change end # 1 1 100.00 2.33 102.33 # 2 2 91.61 -3.56 88.05 # 3 3 95.81 -4.20 91.61 # 4 4 102.33 -6.52 95.81 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 `

and

`diffs <- outer(transactions$start, transactions$start, `-`) matches <- abs(sweep(diffs, 2, transactions$change, `-`)) < 1e-3 `

I guess that computing `diffs`

is the most computationally expensive part in the whole solution. `diffs`

has all possible differences between `start`

of your `transactions`

. Then comparing those with the `change`

column in `matches`

we know which pairs of rows of `transactions`

should go together. If there were no problems regarding numeric precision, we could then use the `match`

function and be done quickly. In this case, however, we have the following **two options**.

First, we may use `igraph`

.

`library(igraph) (g <- graph_from_adjacency_matrix(t(matches) * 1)) # IGRAPH 45d33f0 D--- 6 5 -- # + edges from 45d33f0: # [1] 1->4 2->5 3->2 4->3 5->6 `

That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is `1`

):

`transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[[1]]), ] # row start change end # 1 1 100.00 2.33 102.33 # 4 4 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 2 2 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 `

Another option is recursive.

`fun <- function(x, path = x) { if(length(xNew <- which(matches[, x])) > 0) fun(xNew, c(path, xNew)) else path } transactions[fun(which(rowSums(matches) == 0)), ] # row start change end # 1 1 100.00 2.33 102.33 # 4 4 102.33 -6.52 95.81 # 3 3 95.81 -4.20 91.61 # 2 2 91.61 -3.56 88.05 # 5 5 88.05 7.92 95.97 # 6 6 95.97 3.61 99.58 `

It uses the same unique longest path graph idea as the previous approach.

No explicit loops... And of course you may rewrite everything with `%>%`

, but it won't be as pretty as you want; this is not really a traditional data transformation task where `dplyr`

is best.