What's an efficient method to extract only the rows with the first occurrence from a data set?

  • A+

I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?

Example with 10 encounters for 4 patients:

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077) patient_ID <- c(855,721,821,855,423,423,855,721,423,855) gender <- c(0,0,1,0,1,1,0,0,1,0) df <- data.frame(encounter_ID, patient_ID, gender) 

Result (desired and obtained):

    encounter_ID    patient_ID  gender     1003            855         0     1022            721         0     1013            821         1     1002            423         1 

My approach

1) Extract a list of the unique patients

list.patients <- unique(df$patient_ID) 

2) Create an empty data frame to receive our output of the first encounter per patient

one.encounter <- data.frame() 

3) Go through each patient on the list to extract their first encounter and populate our data frame

for (i in 1:length(list.patients)) { one.patient <- df %>% filter(patient_ID==list.patients[i]) one.patient.ordered <- one.patient[order(one.patient$encounter_ID),] first.encounter <- head(one.patient.ordered, n=1) one.encounter <- rbind(one.encounter, first.encounter) }  


Generally R works fastest if you vectorize operations. Therefore the question is what you mean when you ask for more efficient ways to solve this?

To illustrate this I show you a solution in base R and run a microbenchmark:

microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3()) Unit: microseconds      expr    min      lq     mean  median     uq     max neval  myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4   100  myfun2()  834.7  993.50 1404.901 1083.95 1247.5 20456.2   100  myfun3()  133.3  162.75  258.533  193.75  233.8  3561.7   100 

Your solution is myfun1(), @SmitM dplyr-version is myfun2() and my solution (myfun3) looks like this:

df_ordered=df[order(df$patient_ID,df$encounter_ID),] df_ordered[match(unique(df_ordered$patient_ID),df_ordered$patient_ID),] 

Now you can choose what you like most: dplyr solutions are very nice to read and I think also can be exported into other programming languages. The base R solutions are very fast, but usually not as nice to read and to the best of my knowledge can't be exported into other languages.

I posted the base R-version here because it is relatively nice to read, because every function does what it is called like - still dplyr looks nicer though.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: