Faster alternative to perform pandas groupby operation

  • A+
Category:Languages

I have a dataset with name(person_name), day and color(shirt_color) as columns Each person wears a shirt with a certain color on a particular day (number of days can be arbitrary)

eg input:

name  day  color ---------------- John   1   White John   2   White John   3   Blue John   4   Blue John   5   White Tom    2   White Tom    3   Blue Tom    4   Blue Tom    5   Black Jerry  1   Black Jerry  2   Black Jerry  4   Black Jerry  5   White 

I need to find the most frequently used color by each person eg result:

name    color ------------- Jerry   Black John    White Tom     Blue 

I am performing the following operation to get the results, which works fine but is quite slow:

most_frquent_list = [[name, group.color.mode()[0]]                          for name, group in data.groupby('name')] most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color']) 

Now suppose I have a dataset with 5 million unique name, what is the best/fastest way to perform the above operation?

 


Solution from pd.Series.mode

df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True) Out[281]:  name Jerry    Black John     White Tom       Blue Name: color, dtype: object 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: