Drop duplicates based on majority rule

  • A+
Category:Languages

I have a table that looks like this:

A  B 1  cat 1  cat 1  dog 2  illama 2  alpaca 3  donkey 

Using A as the key, I'd like to remove duplicates such that that dataframe becomes:

A  B 1  cat 3  donkey 

1 is duplicated three times, the value cat occurs the most so it is recorded. there is no majority for 2 so it is considered ambiguous and removed completely. 3 remains as it has no duplicate.


groupby + pd.Series.mode

This is a two step solution using pd.Series.mode:

# find the mode for each group i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True) # filter out groups which have more than one mode—ambiguous groups j = i[i.groupby(level=0).transform('count') == 1].reset_index() 

print(j)     A       B 0  1     cat 1  3  donkey 

groupby + <custom func>

Alternatively, define a custom function that computes the mode and call it with apply. The filtration logic is subsumed into the function.

def foo(x):     m = pd.Series.mode(x)     if len(m) == 1:          return m  df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()     A       B 0  1     cat 1  3  donkey 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: