Creating a Python generator that yields ordered products of integers from two large lists

  • A+

So, I have two very large lists of numbers l1 and l2. I'd like to multiply each element of l1 with each element of l2 without explictly creating a new list of products. Thus, I'd like a generator. This part is easy. I can just do something like

for a in l1:     for b in l2:         yield a * b 

However, I also need these products to be ordered by their magnitude. I am wondering if there is some clever trick to order the yield statements so that this can also be done with a generator. In Python 3, if possible. Thanks.


I'll call the lists xs and ys, and assume they're sorted. As you noted in a comment, the smallest product is necessarily xs[0] * ys[0] - but only if you're also assuming that all numbers are non-negative, so I'll assume that too.

After the first product, it gets messier - else you already would have solved it ;-) The next two to consider are xs[0] * ys[1] and xs[1] * ys[0]. Easy enough, but then the next to consider depends on which of those won. If xs[0] * ys[1] won, you only need to replace it with xs[0] * ys[2], but if xs[1] * ys[0] won then both xs[1] * ys[1] and xs[2] * ys[0] come into play. And so on.

The following keeps track of the growing set of possibilities with a heap. The heap never holds more than len(xs) items, so the code first arranges to make xs the shorter list:

def upprod(xs, ys):     # xs and ys must be sorted, and non-negative     from heapq import heappush, heappop     # make xs the shorter     if len(ys) < len(xs):         xs, ys = ys, xs     if not xs:         return     lenxs = len(xs)     lenys = len(ys)     # the heap holds 4-tuples:     #     (product, xs index, ys index, xs[xs index])     h = [(xs[0] * ys[0], 0, 0, xs[0])]     while h:         prod, xi, yi, x = heappop(h)         yield prod         # same x with next y         yi += 1         if yi < lenys:             heappush(h, (x * ys[yi], xi, yi, x))         # if this is the first time we used x, start         # the next x going         if yi == 1:             xi += 1             if xi < lenxs:                 x = xs[xi]                 heappush(h, (x * ys[0], xi, 0, x)) 

I'd be pleasantly surprised if an essentially more efficient solution existed. If someone thinks they have one, please try it first using this randomized tester:

from itertools import product from random import randrange MAXLEN = 10 UB = 1000 ntest = 0 while True:     ntest += 1     lenxs = randrange(MAXLEN + 1)     lenys = randrange(MAXLEN + 1)     xs = sorted(randrange(UB) for i in range(lenxs))     ys = sorted(randrange(UB) for i in range(lenys))     brute = sorted(a*b for a, b in product(xs, ys))     got = list(upprod(xs, ys))     if brute != got:         print("OUCH!")         print(xs)         print(ys)         print(brute)         print(got)         assert False     if ntest % 10_000 == 0:         print(f"finished test {ntest:,}") 


The above doesn't fully exploit the partial ordering we can deduce from indices alone: if

i1 <= i2 and j1 <= j2 

then we know

xs[i1] * ys[j1] <= xs[i2] * ys[j2] 

because sorting implies xs[i1] <= xs[i2] and ys[j1] <= ys[j2].

So, for example, if the index pairs (0, 1) and (1, 0) are on the heap, and the second one wins, (2, 0) needs to be added to the heap but (1, 1) doesn't: from the indices alone, we know that the product from the (0, 1) remaining in the heap is no larger than the product from (1, 1). It's only when (0, 1) is also removed that (1, 1) needs to be added.

In general, each pair of the form (i, 0) has a single immediate predecessor (i-1, 0), and (0, j) the single (0, j-1), and all other (i, j) have two immediate predecessors: (i-1, j) and (i, j-1). There's no need to put a pair on the heap until all its predecessors have been taken off the heap.

Which leads to this code, which is seemingly "more elegant" because more symmetric:

def upprod(xs, ys):     # xs and ys must be sorted, and non-negative     from heapq import heappush, heappop     # make xs the shorter     if len(ys) < len(xs):         xs, ys = ys, xs     if not xs:         return     lenxs = len(xs)     lenys = len(ys)     # the heap holds 3-tuples:     #     (product, xs index, ys index)     h = [(xs[0] * ys[0], 0, 0)]      # interior points for which only one immediate predecessor has     # been processed; there's no need to put them in the heap     # until their second predecessor has been processed too     pending = set()      def add(xi, yi):         if xi < lenxs and yi < lenys:             if xi and yi: # if either is 0, only one predecessor                 p = xi, yi                 if p in pending:                     pending.remove(p)                 else:                     pending.add(p)                     return             heappush(h, (xs[xi] * ys[yi], xi, yi))      while h:         prod, xi, yi = heappop(h)         yield prod         # same x with next y; and same y with next x         add(xi, yi + 1)         add(xi + 1, yi)     assert not pending 

Compared to the first code, it keeps the heap somewhat smaller in many cases. But heap operations take time logarithmic in the number of heap entries, and the heap can still grow to len(xs) entries, so that's not much of a win. It's probably lost to the overhead of the two new function calls (while inlining those is too ugly to bear).


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: