Getting first n unique elements from Python list

  • A+
Category:Languages

I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6] 

I want to get the first n unique elements from the list. So, in this case, if i want the first 5 unique elements, they would be:

>>> [1,2,3,4,5] 

I have come up with a solution using generators:

def iterate(itr, upper=5):      count = 0     for index, element in enumerate(itr):         if index==0:             count += 1             yield element          elif element not in itr[:index] and count<upper:             count += 1             yield element   >>> i = iterate(a, 5) >>> [e for e in i] >>> [1,2,3,4,5] 

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?

 


I would use a set to remember what was seen and return from the generator when you have seen enough:

a = [1,2,2,3,3,4,5,6]  def get_unique_N(iterable, N):     """Yields (in order) the first N unique elements of iterable.      Might yield less if data too short."""     seen = set()     for e in iterable:         if e in seen:             continue         seen.add(e)         yield e         if len(seen) == N:             return  k = get_unique_N([1,2,2,3,3,4,5,6], 4) print(list(k)) 

Output:

[1,2,3,4] 

According to PEP-479 you should return from generators, not raise StopIteration - thanks to @khelwood & @iBug for that piece of comment - one never learns out.

With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration


Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.

Consider [1,2,3,4,4,4,4,5] vs [1]*1000+[2]*1000+[3]*1000+[4]*1000+[5]*1000+[6]:

For 6 uniques (in longer list):

  • you would have lookups of O(1)+O(2)+...+O(5001)
  • mine would have 5001*O(1) lookup + memory for set( {1,2,3,4,5,6})

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: