Compare values of a dictionary and return a count of matching values

  • A+
Category:Languages

I have a dictionary comprised of product names and unique customer emails who have purchased those items that looks like this:

customer_emails = { 'Backpack':['customer1@gmail.com','customer2@gmail.com','customer3@yahoo.com','customer4@msn.com'],  'Baseball Bat':['customer1@gmail.com','customer3@yahoo.com','customer5@gmail.com'], 'Gloves':['customer2@gmail.com','customer3@yahoo.com','customer4@msn.com']} 

I am trying to iterate over the values of each key and determine how many emails match in the other keys. I converted this dictionary to a DataFrame and got the answer I wanted for a single column comparison using something like this

customers[customers['Baseball Bat'].notna() == True]['Baseball Bat'].isin(customers['Gloves']).sum() 

What I'm trying to accomplish is to create a DataFrame that essentially looks like this so that I can easily use it for correlation charts.

             Backpack  Baseball Bat    Gloves Backpack            4             2         3 Baseball Bat        2             3         1  Gloves              3             1         3 

I'm thinking the way to do it is to iterate over the customer_emails dictionary but I'm not sure how you would pick out a single key to compare its values to all others and so on, then store it.


Start with pd.DataFrame.from_dict:

df = pd.DataFrame.from_dict(customer_emails, orient='index').T  df               Backpack         Baseball Bat               Gloves 0  customer1@gmail.com  customer1@gmail.com  customer2@gmail.com 1  customer2@gmail.com  customer3@yahoo.com  customer3@yahoo.com 2  customer3@yahoo.com  customer5@gmail.com    customer4@msn.com 3    customer4@msn.com                 None                 None 

Now, use stack + get_dummies + sum + dot:

v = df.stack().str.get_dummies().sum(level=1) v.dot(v.T)                Backpack  Baseball Bat  Gloves Backpack             4             2       3 Baseball Bat         2             3       1 Gloves               3             1       3 

Alternatively, switch stack with melt for some added performance.

v = (df.melt()        .set_index('variable')['value']        .str.get_dummies()        .sum(level=0) ) v.dot(v.T)  variable      Backpack  Baseball Bat  Gloves variable                                     Backpack             4             2       3 Baseball Bat         2             3       1 Gloves               3             1       3 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: