Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

  • A+
Category:Languages

I have a Pandas DataFrame with the following structure.

    mytable = pd.DataFrame({'user_id': [ '3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c'],                   'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',             'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 'label': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B',         'C', 'C', 'A']})    mytable =  mytable[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.      user_id  meet_id text label    3c        1        abc     A    3c        1        abc     A    3c        1        abc     A    3c        1        abc     A    3c        1        abc     A    3c        1        abc     B    3c        1        abc     B    3c        2        xyz     B    3c        2        xyz     B    3c        2        xyz     B    3c        2        xyz     C    3c        2        xyz     C    3c        2        xyz     A 

I would like to groupby on the meet_id column and concatenate the label column in such a way that the label with higher frequency for that group is left untouched, while the second most frequent label will have the first label concatenated, and the last label will have all labels concatenated.

the following mocked up DataFrame is the structure I am looking for

    mytable_pro = pd.DataFrame({'user_id': ['3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c'],               'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',         'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 'label': ['A', 'A', 'A', 'A', 'A','B, A', 'B,A', 'B', 'B', 'B',     'B, C', 'B, C', 'A,B,C']})     mytable_pro = mytable_pro[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out. 

I tried the following but the output flattens down everything to a single row which does not work.

    flat_df = mytable.groupby(['user_id', 'meet_id', 'text'])['label']./ apply(lambda x: ', '.join(set(x.astype(str)))).reset_index() 

This gives me the following, not really what I want

    user_id  meet_id  text    label      3c        1      abc      A, B      3c        2      xyz      A,C, B 

I would like this instead

    user_id  meet_id  text  label       3c        1     abc      A       3c        1     abc      A       3c        1     abc      A       3c        1     abc      A       3c        1     abc      A       3c        1     abc    B, A       3c        1     abc    B,A       3c        2     xyz     B       3c        2     xyz     B       3c        2     xyz     B       3c        2     xyz    B, C       3c        2     xyz    B, C       3c        2     xyz    A,B,C 

Please help. Thanks a ton! I am open to other approaches as well, maybe itertools, anything.


With transform and a custom cumulative unique function

from collections import Counter  def cum_unique(x):     return pd.Series(list(map(         Counter, x     ))).cumsum().str.join(', ')  mytable.assign(label=mytable.groupby('meet_id').label.transform(cum_unique))     user_id  meet_id text    label 0       3c        1  abc        A 1       3c        1  abc        A 2       3c        1  abc        A 3       3c        1  abc        A 4       3c        1  abc        A 5       3c        1  abc     A, B 6       3c        1  abc     A, B 7       3c        2  xyz        B 8       3c        2  xyz        B 9       3c        2  xyz        B 10      3c        2  xyz     B, C 11      3c        2  xyz     B, C 12      3c        2  xyz  B, C, A 

Shortened version

mytable.assign(label=mytable.groupby('meet_id').label.transform(     lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ') )) 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: