Python Pandas – difference between 'loc' and 'where'?

  • A+
Category:Languages

Just curious on the behavior of 'where' and why you would use it over 'loc'.

If I create a dataframe:

df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],                     'Run Distance':[234,35,77,787,243,5435,775,123,355,123],                    'Goals':[12,23,56,7,8,0,4,2,1,34],                    'Gender':['m','m','m','f','f','m','f','m','f','m']}) 

And then apply the 'where' function:

df2 = df.where(df['Goals']>10) 

I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:

  Gender  Goals    ID  Run Distance                                                                                                                                                   0      m   12.0   1.0         234.0                                                                                                                                                   1      m   23.0   2.0          35.0                                                                                                                                                   2      m   56.0   3.0          77.0                                                                                                                                                   3    NaN    NaN   NaN           NaN                                                                                                                                                   4    NaN    NaN   NaN           NaN                                                                                                                                                   5    NaN    NaN   NaN           NaN                                                                                                                                                   6    NaN    NaN   NaN           NaN                                                                                                                                                   7    NaN    NaN   NaN           NaN                                                                                                                                                   8    NaN    NaN   NaN           NaN                                                                                                                                                   9      m   34.0  10.0         123.0   

If however I use the 'loc' function:

df2 = df.loc[df['Goals']>10] 

It returns the dataframe subsetted without the NaN values:

  Gender  Goals  ID  Run Distance                                                                                                                                                     0      m     12   1           234                                                                                                                                                     1      m     23   2            35                                                                                                                                                     2      m     56   3            77                                                                                                                                                     9      m     34  10           123  

So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?

 


Think of loc as a filter - give me only the parts of the df that conform to a condition.

where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.

ID  Run Distance Goals Gender 0   1   234      12     m 1   2   35       23     m 2   3   77       56     m 3   0   0        0      0 4   0   0        0      0 5   0   0        0      0 6   0   0        0      0 7   0   0        0      0 8   0   0        0      0 9   10  123      34     m 

Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:

    Gender  Goals 0   m   12 1   m   23 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: