Remove item from list based on the next item in same list

  • A+
Category:Languages

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

ABCDE ABCDEFG ABCDEFGH ABCDEFGHIJKLMNO CEST DBTSFDE DBTSFDEO EOEUDNBNUW EOEUDNBNUWD EAEUDNBNUW FEOEUDNBNUW FG FGH 

I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:

ABCDEFGHIJKLMNO CEST DBTSFDEO EAEUDNBNUW FEOEUDNBNUWD FGH 

How can I do it? My code looks like this:

with open('toy.txt' ,'r') as f:     pattern = f.read().splitlines()     print pattern      for i in range(0, len(pattern)):         if pattern[i] in pattern[i+1]:             pattern.remove(pattern[i])         print pattern 

And i got the error message:

['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH'] Traceback (most recent call last):   File "test.py", line 8, in <module>     if pattern[i] in pattern[i+1]: IndexError: list index out of range 

Thank you very much!

EDIT: Sorry all, I forgot to mention that some string could be this:

EOEUDNBNUW

EOEUDNBNUWD

FEOEUDNBNUW

so I think the .startswith won't work in this case.

 


You could use groupby() and max() to help here:

from itertools import groupby  with open('toy.txt') as f_input:     for key, group in groupby(f_input, lambda x: x[:2]):         print(max(group, key=lambda x: len(x)).strip()) 

This would display:

ABCDEFGHIJKLMNO CEST DBTSFDEO EOEUDNBNUW EAEUDNBNUW FGH 

groupby() works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max() function then takes this list and returns the list item with the longest length.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: