The pandas equivalent of 'if' 'else' conditionals to add calculated column to df

  • A+
Category:Languages

the below table counts unique words in a text (German text of Hamlet in this case).

Using Pandas I would like to add a column['frequency'] that prints one of three answers.

  • If the value in the 'count' column is <=10 the frequency is 'infrequent'

  • If the value in the 'count' column is >10 the frequency is 'frequent'

  • If the value in the 'count' column is 1 the frequency is 'unique'

I am new to pandas so I initially thought I would have to use a 'for' loop and 'if' 'else'. Of course, that didn't work for me and after reading around this I see you can just use .loc[] instead. It's much cleaner.

I'll put the answer below in case anyone else needs this setting out really clearly. 🙂 Here's the table I'm working with before-

      count                 word  length 0     67223                            0 1         7               deinen       6 2         1          überwachsen      11 3         3                 them       4 4         2            fortunens       9 5         1              flammen       7 6         1    ersäuentsezlichen      17 7         2              alleino       7 8         1             empfehle       8 9         1  beschulöffentlicher      19 10        1         unterthänige      12 11        1                   pr       2 12        1       zurükzutreiben      14 13       38                   wo       2 14        1          schadhaften      11 15        1               ddiese       6 16        1         zurükhaltend      12 17        1                 laim       4 18        1               agents       6 

 


This is a fantastic use case for pd.cut:

pd.cut(df['count'],         bins=[-np.inf, 1, 10, np.inf],         labels=['unique', 'infrequent', 'frequent'])  0       frequent 1     infrequent 2         unique 3     infrequent 4     infrequent 5         unique 6         unique 7     infrequent 8         unique 9         unique 10        unique 11        unique 12        unique 13      frequent 14        unique 15        unique 16        unique 17        unique 18        unique Name: count, dtype: category Categories (3, object): [unique < infrequent < frequent] 

The disadvantage with np.select in the other answer is that you will need to evaluate all conditions before selection, and will not scale as well with more conditions.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: