I have a dataset with name(person_name), day and color(shirt_color) as columns Each person wears a shirt with a certain color on a particular day (number of days can be arbitrary)
name day color ---------------- John 1 White John 2 White John 3 Blue John 4 Blue John 5 White Tom 2 White Tom 3 Blue Tom 4 Blue Tom 5 Black Jerry 1 Black Jerry 2 Black Jerry 4 Black Jerry 5 White
I need to find the most frequently used color by each person eg result:
name color ------------- Jerry Black John White Tom Blue
I am performing the following operation to get the results, which works fine but is quite slow:
most_frquent_list = [[name, group.color.mode()] for name, group in data.groupby('name')] most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])
Now suppose I have a dataset with 5 million unique name, what is the best/fastest way to perform the above operation?
df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True) Out: name Jerry Black John White Tom Blue Name: color, dtype: object