Python: determine if three text strings stored in a dataframe have any words in common

  • A+
Category:Languages

Say I have the following dataframe df:

      A             B               C 0     mom;dad;son;  sister;son;     yes;no;maybe; 1     dad;          daughter;niece; no;snow; 2     son;dad;      cat;son;dad;    tree;dad;son; 3     daughter;mom; niece;          referee; 4     dad;daughter; cat;            dad; 

And you want to check if, between columns A, B, and C, there is a common word, and create a column D with 1 if there is and 0 if there isn't any. For a word to be common, it's enough for it to appear in just two of the three columns.

The outcome should be:

      A             B               C              D 0     mom;dad;son;  sister;son;     yes;no;maybe;  1 1     dad;          daughter;niece; no;snow;       0 2     son;dad;      cat;son;dad;    tree;dad;son;  1 3     daughter;mom; niece;          referee;       0 4     dad;daughter; cat;            dad;           1 

I am trying to implement this by doing:

for index, row in df.iterrows():      w1=row['A'].split(';')     w2=row['B'].split(';')     w3=row['C'].split(';')      if len(set(w1).intersection(w2))>0 or len(set(w1).intersection(w3))>0 or len(set(w2).intersection(w3))>0:         df['D'][index]==1     else:         df['D'][index]==0 

However, the resulting D column only bears 0 because (possibly) I am not comparing each individual word in w1 to the others in w2 and w3. How could I achieve this?


Use stack + pandas.Series.str.get_dummies

df.assign(     D=df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int) )                 A                B              C  D 0   mom;dad;son;      sister;son;  yes;no;maybe;  1 1           dad;  daughter;niece;       no;snow;  0 2       son;dad;     cat;son;dad;  tree;dad;son;  1 3  daughter;mom;           niece;       referee;  0 4  dad;daughter;             cat;           dad;  1 

Details

Notice that when we stack and get dummies, the interim result looks like this:

     cat  dad  daughter  maybe  mom  niece  no  referee  sister  snow  son  tree  yes 0 A    0    1         0      0    1      0   0        0       0     0    1     0    0   B    0    0         0      0    0      0   0        0       1     0    1     0    0   C    0    0         0      1    0      0   1        0       0     0    0     0    1 1 A    0    1         0      0    0      0   0        0       0     0    0     0    0   B    0    0         1      0    0      1   0        0       0     0    0     0    0   C    0    0         0      0    0      0   1        0       0     1    0     0    0 2 A    0    1         0      0    0      0   0        0       0     0    1     0    0   B    1    1         0      0    0      0   0        0       0     0    1     0    0   C    0    1         0      0    0      0   0        0       0     0    1     1    0 3 A    0    0         1      0    1      0   0        0       0     0    0     0    0   B    0    0         0      0    0      1   0        0       0     0    0     0    0   C    0    0         0      0    0      0   0        1       0     0    0     0    0 4 A    0    1         1      0    0      0   0        0       0     0    0     0    0   B    1    0         0      0    0      0   0        0       0     0    0     0    0   C    0    1         0      0    0      0   0        0       0     0    0     0    0 

Where the prior columns are embedded in the second level of the index. So I want to sum over the first level in order to see how many times that word appears.

That summation interim looks like:

   cat  dad  daughter  maybe  mom  niece  no  referee  sister  snow  son  tree  yes 0    0    1         0      1    1      0   1        0       1     0    2     0    1 1    0    1         1      0    0      1   1        0       0     1    0     0    0 2    1    3         0      0    0      0   0        0       0     0    3     1    0 3    0    0         1      0    1      1   0        1       0     0    0     0    0 4    1    2         1      0    0      0   0        0       0     0    0     0    0 

Notice that we catch 'son' in row 1, 'dad' and 'son' in row 3 and so on.

If it appears in more than 1 column (hence gt(1)) then I want to count it as a 1 (hence any(1).astype(int)).

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: