I have a sorted array of N intervals of different length. I am plotting these intervals with alternating colors blue/green.
I am trying to find a method or algorithm to "downsample" the array of intervals to produce a visually similar plot, but with less elements.
Ideally I could write some function where I can pass the target number of output intervals as an argument. The output length only has to come close to the target.
input = [ [0, 5, "blue"], [5, 6, "green"], [6, 10, "blue"], // ...etc ] output = downsample(input, 25) // [[0, 10, "blue"], ... ]
Below is a picture of what I am trying to accomplish. In this example the input has about 250 intervals, and the output about ~25 intervals. The input length can vary a lot.
Below is my original post which I deleted yesterday, because there were issues with displaying the equations and also I wasn't very confident if it really makes sense. But today, I figured that the optimisation problem that I described can be actually solved efficiently with DP (Dynamic programming).
So I did a sample C++ implementation. Here are some results:
Here is a live demo that you can play with in your browser (make sure browser support WebGL2, like Chrome or Firefox). It takes a bit to load the page.
Here is the C++ implementation: link
Hope this is useful.
Original post below:
Even-though some good algorithms have already been proposed, I would like to propose a slightly unusual approach - interpreting the task as an optimisation problem. Although, I don't know how to efficiently solve the optimisation problem (or even if it can be solved in reasonable time at all), it might be useful to someone purely as a concept.
First, without loss of generality, lets declare the blue color to be background. We will be painting N green intervals on top of it (N is the number provided to the
downsample() function in OP's description). The ith interval is defined by its starting coordinate 0 <= xi < xmax and width wi >= 0 (xmax is the maximum coordinate from the input).
Lets also define the array G(x) to be the number of green cells in the interval [0, x) in the input data. This array can easily be pre-calculated. We will use it to quickly calculate the number of green cells in arbitrary interval [x, y) - namely: G(y) - G(x).
We can now introduce the first part of the cost function for our optimisation problem:
The smaller F1 is, the better our generated intervals cover the input intervals, so we will be searching for xi, wi that minimise it. Ideally we want F1=0 which would mean that the intervals do not cover any of the background (which of course is not possible because N is less than the input intervals).
However, this function is not enough to describe the problem, because obviously we can minimise it by taking empty intervals: F1(x, 0)=0. Instead, we want to cover as much as possible from the input intervals. Lets introduce the second part of the cost function which corresponds to this requirement:
The smaller F2 is, the more input intervals are covered. Ideally we want F2=0 which would mean that we covered all of the input rectangles. However, minimising F2 competes with minimising F1.
Finally, we can state our optimisation problem: find xi, wi that minimize F=F1 + F2
How to solve this problem? Not sure. Maybe use some metaheuristic approach for global optimisation such as Simulated annealing or Differential evolution. These are typically easy to implement, especially for this simple cost function.
Best case would be to exist some kind of DP algorithm for solving it efficiently, but unlikely.