# What's an efficient method to extract only the rows with the first occurrence from a data set?

• A+
Category：Languages

I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?

Example with 10 encounters for 4 patients:

``encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077) patient_ID <- c(855,721,821,855,423,423,855,721,423,855) gender <- c(0,0,1,0,1,1,0,0,1,0) df <- data.frame(encounter_ID, patient_ID, gender) ``

Result (desired and obtained):

``    encounter_ID    patient_ID  gender     1003            855         0     1022            721         0     1013            821         1     1002            423         1 ``

My approach

1) Extract a list of the unique patients

``list.patients <- unique(df\$patient_ID) ``

2) Create an empty data frame to receive our output of the first encounter per patient

``one.encounter <- data.frame() ``

3) Go through each patient on the list to extract their first encounter and populate our data frame

``for (i in 1:length(list.patients)) { one.patient <- df %>% filter(patient_ID==list.patients[i]) one.patient.ordered <- one.patient[order(one.patient\$encounter_ID),] first.encounter <- head(one.patient.ordered, n=1) one.encounter <- rbind(one.encounter, first.encounter) }  ``

Generally R works fastest if you vectorize operations. Therefore the question is what you mean when you ask for more efficient ways to solve this?

To illustrate this I show you a solution in `base R` and run a `microbenchmark`:

``microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3()) Unit: microseconds      expr    min      lq     mean  median     uq     max neval  myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4   100  myfun2()  834.7  993.50 1404.901 1083.95 1247.5 20456.2   100  myfun3()  133.3  162.75  258.533  193.75  233.8  3561.7   100 ``

Your solution is `myfun1()`, @SmitM `dplyr`-version is `myfun2()` and my solution (`myfun3`) looks like this:

``df_ordered=df[order(df\$patient_ID,df\$encounter_ID),] df_ordered[match(unique(df_ordered\$patient_ID),df_ordered\$patient_ID),] ``

Now you can choose what you like most: `dplyr` solutions are very nice to read and I think also can be exported into other programming languages. The `base R` solutions are very fast, but usually not as nice to read and to the best of my knowledge can't be exported into other languages.

I posted the `base R`-version here because it is relatively nice to read, because every function does what it is called like - still `dplyr` looks nicer though.