Assigning categorical values to NAs randomly or proportionally

  • A+
Category:Languages

I have a dataset:

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",  "male"), Division = c("South Atlantic", "East North Central",  "Pacific", "East North Central", "South Atlantic", "South Atlantic",  "Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,  107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",  "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,  -7L), class = c("tbl_df", "tbl", "data.frame")) 

I need to perform an analysis such that I can't have NA values in the gender variable. The other columns are too few and have no known predictive value so that imputing the values isn't really possible.

I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female or male into the missing cases.

Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NAs with female or male in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NAs?

 


We can use ifelse and is.na to determine if na exist, and then use sample to randomly select female and male.

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender) 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: