Delete group if NaN is present anywhere in multiple columns

  • A+
Category:Languages

I am trying to clean my dataframe such that if my "Base_2007" and "Base_2011" column contains NA, then I should completely drop that county. In my case since both Counties contains NA both of them will be dropped. Thus empty dataset will be returned. Is it possible to do something like this?

Data:

  State  Year  Base_2007  Base_2011           County 0    AL  2012        NaN       14.0  Alabama_Country 1    AL  2013       12.0       20.0  Alabama_Country 2    AL  2014       13.0        NaN  Alabama_Country 3    DC  2011        NaN       20.0          Trenton 4    DC  2012       19.0        NaN          Trenton 5    DC  2013       20.0       21.0          Trenton 6    DC  2014       25.0       30.0          Trenton 

Tail section of dataframe:

{'State': {82550: 'WY', 82551: 'WY', 82552: 'WY', 82553: 'WY', 82554: 'WY', 82555: 'WY', 82556: 'WY', 82557: 'WY', 82558: 'WY', 82559: 'WY'}, 'County': {82550: 'Weston', 82551: 'Weston', 82552: 'Weston', 82553: 'Weston', 82554: 'Weston', 82555: 'Weston', 82556: 'Weston', 82557: 'Weston', 82558: 'Weston', 82559: 'Weston'}, 'FIPS code': {82550: 56045, 82551: 56045, 82552: 56045, 82553: 56045, 82554: 56045, 82555: 56045, 82556: 56045, 82557: 56045, 82558: 56045, 82559: 56045}, 'Year': {82550: 2008, 82551: 2009, 82552: 2010, 82553: 2011, 82554: 2012, 82555: 2013, 82556: 2014, 82557: 2015, 82558: 2016, 82559: 2017}, 'Annual_pct_change': {82550: 6.52, 82551: -2.93, 82552: -5.61, 82553: 1.9, 82554: 5.16, 82555: -4.03, 82556: 7.69, 82557: -2.35, 82558: 1.67, 82559: 5.56}, 'HPI': {82550: 195.73, 82551: 189.99, 82552: 179.33, 82553: 182.73, 82554: 192.15, 82555: 184.4, 82556: 198.58, 82557: 193.9, 82558: 197.14, 82559: 208.11}, 'HPI1990': {82550: nan, 82551: nan, 82552: nan, 82553: nan, 82554: nan, 82555: nan, 82556: nan, 82557: nan, 82558: nan, 82559: nan}, 'HPI2000': {82550: 190.09, 82551: 184.51, 82552: 174.16, 82553: 177.46, 82554: 186.61, 82555: 179.08, 82556: 192.86, 82557: 188.31, 82558: 191.46, 82559: 202.11}, 'CountyName': {82550: 'Weston County', 82551: 'Weston County', 82552: 'Weston County', 82553: 'Weston County', 82554: 'Weston County', 82555: 'Weston County', 82556: 'Weston County', 82557: 'Weston County', 82558: 'Weston County', 82559: 'Weston County'}} 

Head Section of DataFrame:

{'State': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL', 5: 'AL', 6: 'AL', 7: 'AL', 8: 'AL', 9: 'AL'}, 'County': {0: 'Autauga', 1: 'Autauga', 2: 'Autauga', 3: 'Autauga', 4: 'Autauga', 5: 'Autauga', 6: 'Autauga', 7: 'Autauga', 8: 'Autauga', 9: 'Autauga'}, 'FIPS code': {0: 1001, 1: 1001, 2: 1001, 3: 1001, 4: 1001, 5: 1001, 6: 1001, 7: 1001, 8: 1001, 9: 1001}, 'Year': {0: 1986, 1: 1987, 2: 1988, 3: 1989, 4: 1990, 5: 1991, 6: 1992, 7: 1993, 8: 1994, 9: 1995}, 'Annual_pct_change': {0: nan, 1: -2.17, 2: 3.24, 3: 4.16, 4: -0.35, 5: 2.69, 6: 2.85, 7: 3.34, 8: 4.33, 9: 3.48}, 'HPI': {0: 100.0, 1: 97.83, 2: 100.99, 3: 105.19, 4: 104.82, 5: 107.64, 6: 110.7, 7: 114.4, 8: 119.35, 9: 123.5}, 'HPI1990': {0: 95.4, 1: 93.33, 2: 96.35, 3: 100.36, 4: 100.0, 5: 102.69, 6: 105.61, 7: 109.14, 8: 113.86, 9: 117.82}, 'HPI2000': {0: 71.03, 1: 69.49, 2: 71.74, 3: 74.72, 4: 74.45, 5: 76.46, 6: 78.63, 7: 81.26, 8: 84.77, 9: 87.72}, 'CountyName': {0: 'Autauga County', 1: 'Autauga County', 2: 'Autauga County', 3: 'Autauga County', 4: 'Autauga County', 5: 'Autauga County', 6: 'Autauga County', 7: 'Autauga County', 8: 'Autauga County', 9: 'Autauga County'}} 

Note: In above dput Base_2007= HPI1990, BASE_2011=HPI2000

 


I tested this on the below dataset (this also requires the NA to be replaced by np.nan as df = df.replace('NA', np.nan) if they are strings)

print(df)    State  Year  Base_2007  Base_2011           County 0    AL  2012        NaN       14.0  Alabama_Country 1    AL  2013       12.0       20.0  Alabama_Country 2    AL  2014       13.0        NaN  Alabama_Country 3    DC  2011        NaN       20.0          Trenton 4    DC  2012       19.0        NaN          Trenton 5    DC  2013       20.0       21.0          Trenton 6    DC  2014       25.0       30.0          Trenton 7    DM  2013       34.0       45.0            Dummy 8    DM  2012       34.0       45.0            Dummy 

Removing County containing NaN works by using :

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna()./                                         any(axis=1),'County'])] print(df_new)    State  Year  Base_2007  Base_2011 County 7    DM  2013       34.0       45.0  Dummy 8    DM  2012       34.0       45.0  Dummy 

I will update the explanation shortly.

Explanation

The following finds any NaN rows based on subset of Base_2007 and Base_2011

df[['Base_2007','Base_2011']].isna().any(axis=1) 0     True 1    False 2     True 3     True 4     True 5    False 6    False 7    False 8    False 

Taking the above output as a boolean mask , we call the df.loc[] function as:

df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County'] 

which gives:

0    Alabama_Country 2    Alabama_Country 3            Trenton 4            Trenton 

Note we are taking only the County column under **df.loc[]**. the reason for this is the next step.

We take the output above and find if any of cells in County column from the original dataframe exists in the output we got above by using s.isin()

This returns True for rows in County which is present in the output of df.loc[].

Then we negate them by an invert ~ which turns all True to False and vice versa.

~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County']) 0    False 1    False 2    False 3    False 4    False 5    False 6    False 7     True 8     True 

Once we have this ready, we apply the same logic as df.loc[] .

Finally we get the dataframe which returns only those County which doesnot have a NaN in Base_2007 and Base_2011.

Note : if we want the index to start from 0 and not the slice of the dataframe, we can add a reset_index(drop=True) to the end of the code as:

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna()./                                                      any(axis=1),'County'])].reset_index(drop=True)     State  Year  Base_2007  Base_2011 County 0    DM  2013       34.0       45.0  Dummy 1    DM  2012       34.0       45.0  Dummy 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: