Picking unordered combinations from pools with overlap

  • A+
Category:Languages

I have pools of values and I would like to generate every possible unordered combination by picking from certain pools.

For example, I wanted to pick from pool 0, pool 0, and pool 1:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] >>> part = (0, 0, 1) >>> list(product(*(pools[i] for i in part))) [(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)] 

This generates every possible combination by picking from pool 0, pool 0, and pool 1.

However order doesn't matter to me, so many of the combinations are actually duplicates. For example, since I used a Cartesian product, both (1, 2, 4) and (2, 1, 4) are generated.

I came up with a simple method to mitigate this issue. For members picked from a single pool, I select without ordering using combinations_with_replacement. I count how many times I want to draw from each pool. The code looks like this:

cnt = Counter() for ind in part: cnt[ind] += 1 blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt] return [list(chain(*combo)) for combo in product(*blocks)] 

This reduces ordering duplicates if I happen to choose from the same pool multiple times. However all the pools have lots of overlap, and using combinations_with_replacement on multiple pools merged would generate some invalid combinations. Is there a more efficient method to generate unordered combinations?

Edit: Extra info about the inputs: The number of parts and pools is small (~5 and ~20), and for simplicity, each element is an integer. The actual problem I have already solved so this is just for academic interest. Let's say there are thousands hundreds of integers in each pool but some pools are small and only have dozens. So some kind of union or intersection seems to be the way to go.

 


This is a difficult problem. I think your best bet in the general case is to implement a hash table where the key is a multiset and the value is your actual combination. This is similar to what @ErikWolf mentioned, however this methods avoids producing duplicates in the first place so filtering is not required. It also returns correct result when we encounter multisets.

There is a faster solution that I am teasing now, but saving for later. Bear with me.

As mentioned in the comments, one approach that seems viable is to combine all of the pools and simply generate combinations of this combined pool choose the number of pools. You would need a tool that is capable of generating combinations of multisets, which there is one that I know of that is available in python. It is in the sympy library from sympy.utilities.iterables import multiset_combinations. The problem with this is that we still produce duplicate values and worse, we produce results that are impossible to obtain with an analogous set and product combo. For example, if we were to do something like sort and combine all of the pools from the OP and apply the following:

list(multiset_permutations([1,2,2,3,3,4,4,5])) 

A couple of the results would be [1 2 2] and [4 4 5] which are both impossible to obtain from [[1, 2, 3], [2, 3, 4], [3, 4, 5]].

Outside of special cases, I don't see how it is possible to avoid checking every possible product. I hope I am wrong.

Algorithm Overview
The main idea is to map combinations of our product of vectors to unique combinations without having to filter out duplicates. The example given by the OP (i.e. (1, 2, 3) and (1, 3, 2)) should only map to one value (either one of them, as order doesn't matter). We note that the two vectors are identical sets. Now, we also have situations like :

vec1 = (1, 2, 1) vec2 = (2, 1, 1) vec3 = (2, 2, 1) 

We need vec1 and vec2 to map to the same value whereas vec3 needs to map to its own value. This is the problem with sets as all of these are equivalent sets (with sets, the elements are unique thus {a, b, b} and {a, b} are equivalent).

This is where multisets come into play. With multisets, (2, 2, 1) and (1, 2, 1) are distinct, however (1, 2, 1) and (2, 1, 1) are the same. This is good. We now have a method to generate unique keys.

As I am not a python programmer, so I will proceed in C++.

We will have some issues if we try to implement everything above exactly as is. As far as I know, you can't have std::multiset<int> as the key portion for a std::unordered_map. However, we can for a regular std::map. It isn't as performant as a hash table underneath (it is actually a red-black tree), but it still gives decent performance. Here it is:

void cartestionCombos(std::vector<std::vector<int> > v, bool verbose) {      std::map<std::multiset<int>, std::vector<int> > cartCombs;      unsigned long int len = v.size();     unsigned long int myProd = 1;     std::vector<unsigned long int> s(len);      for (std::size_t j = 0; j < len; ++j) {         myProd *= v[j].size();         s[j] = v[j].size() - 1;     }      unsigned long int loopLim = myProd - 1;     std::vector<std::vector<int> > res(myProd, std::vector<int>());     std::vector<unsigned long int> myCounter(len, 0);     std::vector<int> value(len, 0);     std::multiset<int> key;      for (std::size_t j = 0; j < loopLim; ++j) {         key.clear();          for (std::size_t k = 0; k < len; ++k) {             value[k] = v[k][myCounter[k]];             key.insert(value[k]);         }          cartCombs.insert({key, value});          int test = 0;         while (myCounter[test] == s[test]) {             myCounter[test] = 0;             ++test;         }          ++myCounter[test];     }      key.clear();     // Get last possible combination     for (std::size_t k = 0; k < len; ++k) {         value[k] = v[k][myCounter[k]];         key.insert(value[k]);     }      cartCombs.insert({key, value});      if (verbose) {         int count = 1;          for (std::pair<std::multiset<int>, std::vector<int> > element : cartCombs) {             std::string tempStr;              for (std::size_t k = 0; k < len; ++k)                 tempStr += std::to_string(element.second[k]) + ' ';              std::cout << count << " : " << tempStr << std::endl;             ++count;         }     } } 

With test cases of 8 vectors of lengths from 4 to 8 filled with random integers from 1 to 15, the above algorithm runs in about 5 seconds on my computer. That's not bad considering we are looking at nearly 2.5 million total results from our product, but we can do better. But how?

The best performance is given by std::unordered_map with a key that is built in constant time. Our key above is built in logarithmic time (multiset, map and hash map complexity). So the question is, how can we overcome these hurdles?

Best Performance

We know we must abandon std::multiset. We need some sort of object that has a commutative type property while also giving unique results.

Enter the Fundamental Theorem of Arithmetic

It states that every number can be uniquely represented (up to the order of the factors) by the product of primes numbers. This is sometimes called the prime decomposition.

So now, we can simply proceed as before but instead of constructing a multiset, we map each index to a prime number and multiply the result. This will give us a constant time construction for our key. Here is an example showing the power of this technique on the examples we created earlier above (N.B. P below is a list of primes numbers... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to            product vec1 = (1, 2, 1)    -->>    P[1], P[2], P[1]   --->>   3, 5, 3    -->>    45 vec2 = (2, 1, 1)    -->>    P[2], P[1], P[1]   --->>   5, 3, 3    -->>    45 vec3 = (2, 2, 1)    -->>    P[2], P[2], P[1]   --->>   5, 5, 3    -->>    75 

This is awesome!! vec1 and vec2 map to the same number, whereas vec3 gets mapped to a different value just as we wished.

void cartestionCombosPrimes(std::vector<std::vector<int> > v,                          std::vector<int> primes,                         bool verbose) {      std::unordered_map<int64_t, std::vector<int> > cartCombs;      unsigned long int len = v.size();     unsigned long int myProd = 1;     std::vector<unsigned long int> s(len);      for (std::size_t j = 0; j < len; ++j) {         myProd *= v[j].size();         s[j] = v[j].size() - 1;     }      unsigned long int loopLim = myProd - 1;     std::vector<std::vector<int> > res(myProd, std::vector<int>());     std::vector<unsigned long int> myCounter(len, 0);     std::vector<int> value(len, 0);     int64_t key;      for (std::size_t j = 0; j < loopLim; ++j) {         key = 1;          for (std::size_t k = 0; k < len; ++k) {             value[k] = v[k][myCounter[k]];             key *= primes[value[k]];         }          cartCombs.insert({key, value});          int test = 0;         while (myCounter[test] == s[test]) {             myCounter[test] = 0;             ++test;         }          ++myCounter[test];     }      key = 1;     // Get last possible combination     for (std::size_t k = 0; k < len; ++k) {         value[k] = v[k][myCounter[k]];         key *= primes[value[k]];     }      cartCombs.insert({key, value});     std::cout << cartCombs.size() << std::endl;      if (verbose) {         int count = 1;          for (std::pair<int, std::vector<int> > element : cartCombs) {             std::string tempStr;              for (std::size_t k = 0; k < len; ++k)                 tempStr += std::to_string(element.second[k]) + ' ';              std::cout << count << " : " << tempStr << std::endl;             ++count;         }     } } 

On the same example above that would generate nearly 2.5 million products, the above algorithm returns the same result in less than 0.3 seconds.

There are a couple of caveats with this latter method. We must have our primes generated a priori and if we have many vectors in our Cartesian product, the key could grow beyond the bounds of int64_t. The first issue shouldn't be that difficult to overcome as there are many resources (libraries, lookup tables, etc.) available for generating prime numbers. I'm not really sure, but I've read that the latter issue shouldn't be a problem for python as integers have arbitrary precision (Python integer ranges).

We also have to deal with the fact that our source vectors might not be nice integer vectors with small values. This can be remedied by ranking all of the elements across all vectors before you proceed. For example, given the following vectors:

vec1 = (12345.65, 5, 5432.11111) vec2 = (2222.22, 0.000005, 5) vec3 = (5, 0.5, 0.8) 

Ranking them, we would obtain:

rank1 = (6, 3, 5) rank2 = (4, 0, 3) rank3 = (3, 1, 2) 

And now, these can be used in place of the actual values to create your key. The only portion of the code that would change would be the for loops that build the key (and of course the rank object that would need to be created):

for (std::size_t k = 0; k < len; ++k) {     value[k] = v[k][myCounter[k]];     key *= primes[rank[k][myCounter[k]]]; } 

Edit:
As some of the commenters have pointed out, the above method disguises the fact that all products must be generated. I should have said that the first time around. Personally, I don't see how it can be avoided given the many different presentations.

Also, in case anybody is curious, here is the test case I used above:

[1 10 14  6], [7  2  4  8  3 11 12], [11  3 13  4 15  8  6  5], [10  1  3  2  9  5  7], [1  5 10  3  8 14], [15  3  7 10  4  5  8  6], [14  9 11 15], [7  6 13 14 10 11  9  4] 

It should return 162295 unique combinations.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: