Performing operations on grouped rows in python

  • A+
Category:Languages

I have a dataframe where pic_code value may repeat. If it repeats, I want to set the variable "keep" to "t" for the pic_code that is closest to its mpe_wgt.

For example, the second pic_code has "keep" set to t since it has the "weight" closest to its corresponding "mpe_weight". My code results in "keep" staying 'f' for all and "diff" staying "100" for all.

df['keep']='f' df['diff']=100  def cln_df(data):     if pd.unique(data['mpe_wgt']).shape==(1,):         data['keep'][0:1]='t'     elif pd.unique(data['mpe_wgt']).shape!=(1,):          data['diff']=abs(data['weight']-(data['mpe_wgt']/100))         data['keep'][data['diff']==min(data['diff'])]='t'     return data  df=df.groupby('pic_code').apply(cln_df) 

df before

  pic_code      weight      mpe_wgt    keep    diff   1234          45          34         f       100   1234          32          23         f       100   45344         54          35         f       100   234           76          98         f       100   234           65          12         f       100 

df output should be

  pic_code      weight      mpe_wgt    keep    diff   1234          45          34         f       11   1234          32          23         t       9   45344         54          35         t       100   234           76          98         t       22   234           65          12         f       53 

I'm fairly new to python so please keep the solutions as simple as possible. I really want to make my method work so please don't get too fancy. Thanks in advance for your help.


This is one way. Note I am using Boolean values True / False in place of strings "t" and "f". This is just good practice.

Note that all the below operations are vectorised, while groupby.apply with a custom function certainly is not.

Setup

print(df)     pic_code  weight  mpe_wgt 0      1234      45       34 1      1234      32       23 2     45344      54       35 3       234      76       98 4       234      65       12 

Solution

# calculate difference df['diff'] = (df['weight'] - df['mpe_wgt']).abs()  # sort by pic_code, then by diff df = df.sort_values(['pic_code', 'diff'])  # define keep column as True only for non-duplicates by pic_code df['keep'] = ~df.duplicated('pic_code') 

Result

print(df)     pic_code  weight  mpe_wgt  diff   keep 3       234      76       98    22   True 4       234      65       12    53  False 1      1234      32       23     9   True 0      1234      45       34    11  False 2     45344      54       35    19   True 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: