# Sorting a list of random transactions using dplyr

• A+
Category：Languages

Assume the following set of original transactions:

``library(tidyverse)  original_transactions <- data.frame(   row = 1:6,   start = 0,   change = runif(6, min = -10, max = 10) %>% round(2),   end = 0 ) %>% mutate(   temp = cumsum(change),   end = 100 + temp, # End balance   start = end - change # Start balance ) %>% select(   -temp ) `` It shows a (chronological) sequence of transactions with a starting balance of \$100.00 and an ending balance of \$95.65, with six transactions/changes.

Now assume that you receive a jumbled version of this

``transactions <- original_transactions %>% sample_n(   6 ) %>% mutate(   row = row_number() # Original sequence is unknown ) `` How can I reverse-engineer the sequence in R? That is, to get the sort order of `transactions` to match that of `original_transactions`? Ideally I'd like to do this using `dplyr` and a sequence of pipes `%>%` and avoid loops.

Assume that the start/end balances will be unique and that, in general, the number of transactions can vary.

First, let

``original_transactions #   row  start change    end # 1   1 100.00   2.33 102.33 # 2   2 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 4   4  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58  transactions #   row  start change    end # 1   1 100.00   2.33 102.33 # 2   2  91.61  -3.56  88.05 # 3   3  95.81  -4.20  91.61 # 4   4 102.33  -6.52  95.81 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 ``

and

``diffs <- outer(transactions\$start, transactions\$start, `-`) matches <- abs(sweep(diffs, 2, transactions\$change, `-`)) < 1e-3 ``

I guess that computing `diffs` is the most computationally expensive part in the whole solution. `diffs` has all possible differences between `start` of your `transactions`. Then comparing those with the `change` column in `matches` we know which pairs of rows of `transactions` should go together. If there were no problems regarding numeric precision, we could then use the `match` function and be done quickly. In this case, however, we have the following two options.

First, we may use `igraph`.

``library(igraph) (g <- graph_from_adjacency_matrix(t(matches) * 1)) # IGRAPH 45d33f0 D--- 6 5 --  # + edges from 45d33f0: #  1->4 2->5 3->2 4->3 5->6 ``

That is, we have a hidden path graph: 1->4->3->2->5->6 which we want to recover. It is given by the longest path from the vertex which has no incoming edges (which is `1`):

``transactions[as.vector(tail(all_simple_paths(g, from = which(rowSums(matches) == 0)), 1)[]), ] #   row  start change    end # 1   1 100.00   2.33 102.33 # 4   4 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 2   2  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 ``

Another option is recursive.

``fun <- function(x, path = x) {   if(length(xNew <- which(matches[, x])) > 0)     fun(xNew, c(path, xNew))   else path } transactions[fun(which(rowSums(matches) == 0)), ] #   row  start change    end # 1   1 100.00   2.33 102.33 # 4   4 102.33  -6.52  95.81 # 3   3  95.81  -4.20  91.61 # 2   2  91.61  -3.56  88.05 # 5   5  88.05   7.92  95.97 # 6   6  95.97   3.61  99.58 ``

It uses the same unique longest path graph idea as the previous approach.

No explicit loops... And of course you may rewrite everything with `%>%`, but it won't be as pretty as you want; this is not really a traditional data transformation task where `dplyr` is best.