find gene name from liste to dataframe

  • A+

I actually have to know if I got some gene if my result, to do so I have one list with my genes' names and a dataframe with the same sames:

For exemple


and a dataframe:

name1          name2 gene1_0035     gene1_0042 gene56_0042    gene56_0035 gene4_0042     gene4_0035 gene2_0035     gene2_0042 gene57_0042    gene57_0035 

then I did:

df=pd.read_csv("dataframe_not_max.txt",sep='/t') df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1'])) #print(df) print(list(df.columns.values)) name1=df.ix[:,1] name2=df.ix[:,2]   liste=[] for record in SeqIO.parse(data, "fasta"):     liste.append(  print(liste) print(len(liste))  count=0 for a, b in zip(name1, name2):      if a in liste:         count+=1      if b in liste:          count+=1 print(count) 

And what I want is to know how many time I find the gene in ma dataframe from my list but they do not have exactly the same ID since in the list there is not the _number after the gene name, then the if i in liste does not reconize the ID.

Is it possible to say something like :

if a without_number in liste:  

In the above exemple it would be : count = 3 because only gene 1,2 and 4 are present in both the list and the datafra.

Here is a more complicated exemple to see if your script indeed works for my data: Let's say I have a dataframe such:

  cluster_name  qseqid  sseqid  pident_x 15  cluster_016607  EOG090X00GO_0035_0035   EOG090X00GO_0042_0035 16  cluster_016607  EOG090X00GO_0035_0035   EOG090X00GO_0042_0042 18  cluster_016607  EOG090X00GO_0035_0042   EOG090X00GO_0042_0035 19  cluster_016607  EOG090X00GO_0035_0042   EOG090X00GO_0042_0042 29  cluster_015707  EOG090X00LI_0035_0035   EOG090X00LI_0042_0042 30  cluster_015707  EOG090X00LI_0035_0035   EOG090X00LI_0042_0035 34  cluster_015707  EOG090X00LI_0042_0035   g1726.t1_0035_0042 37  cluster_015707  EOG090X00LI_0042_0042   g1726.t1_0035_0042 

and a list : ["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]

here I get 6 but I should get 2 because I have only 2 sequences in my data EOG090X00LI and EOG090X00GO

in fact, here I want to count when a sequence is present only when it appears once, even if it is for exemple: EOG090X00LI vs seq123454

I do not know if it is clear?

I used for the exemple :

df=pd.read_csv("test_busco_augus.csv",sep=',') #df=df.drop(columns=(['Unnamed: 0', 'Unnamed: 0.1'])) print(df) print(list(df.columns.values)) name1=df.ix[:,3] name2=df.ix[:,4]  liste=["EOG090X00LI_","EOG090X00GO_","EOG090X00BA_"]  print(liste)   #get boolean mask for each column     m1 = name1.str.contains('|'.join(liste)) m2 = name2.str.contains('|'.join(liste))  #chain masks and count Trues  a = (m1 & m2).sum() print (a) 

I think need:

#add _ to end of values liste =  [ + '_' for record in SeqIO.parse(data, "fasta")] #liste = ["gene1_","gene2_","gene3_","gene4_","gene5_"]  #get boolean mask for each column     m1 = df['name1'].str.contains('|'.join(liste)) m2 = df['name2'].str.contains('|'.join(liste))  #chain masks and count Trues a = (m1 & m2).sum() print (a) 3 


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: