How to extract lines above a match using regex and R?

  • A+
Category:Languages

I would like to match some specific string using R and keep only the line above that match, here is some example data. Having a file with hundreds of similar cases:

first_case<- data.frame(line =                c("#John Wayne: Su, 11.01.2013 08:24:42#                 He is present / I guess, Does great job                 --------------------------------------------------                 #Michal Thorn: Fr, 12.09.2015 17:23:01#                 Works quite frequently with people                 --------------------------------------------------                 #Sandra Nunes: Mo, 20.05.2011 09:00:29#                 She has some new clients"))    second_case<- data.frame(line =                   c("#Boris Jonson: Mo, 30.09.2017 09:20:42#                 He is present                 --------------------------------------------------                 #Jacky Fine: Th, 02.02.2013 18:23:01#                 Does great job                 --------------------------------------------------                 #Michael Bissping: Mo, 25.03.2012 10:00:29#                 Hard to count on"))    third_case<- data.frame(line =                 c("#Isabelle Warren: Sa, 02.12.2013 02:24:42#                  Not around / anymore                --------------------------------------------------                  #Tobias Maker: Mo, 02.03.2013 10:23:01#                  Works quite frequently with people                --------------------------------------------------                  #Toe Michael : Mo, 20.05.2011 09:00:29#                  She has some new clients & Does great job"))  all_cases <- rbind(first_case,second_case,third_case) 

Here I try to filter those lines which are 1 line above:

Does great job

By looking if Does great job ends with new line and take the first line above:

dplyr::filter(all_cases, grepl("((.*/n){1})Does great job",line)) 

Expected results:

first_case<- data.frame(line =                        c("#John Wayne: Su, 11.01.2013 08:24:42#")) second_case<- data.frame(line =                         c("#Jacky Fine: Th, 02.02.2013 18:23:01#")) third_case<- data.frame(line =                        c("#Toe Michael : Mo, 20.05.2011 09:00:29#"))  expected_result <- rbind(first_case,second_case,third_case)  1   #John Wayne: Su, 11.01.2013 08:24:42# 2   #Jacky Fine: Th, 02.02.2013 18:23:01# 3   #Toe Michael : Mo, 20.05.2011 09:00:29# 

Unfortunately, this returns zero rows. Appreciate any insights!

 


You could try :

library(stringr) library(dplyr)  all_cases %>% transmute(x=str_extract(line,".*(?=/n.*?Does great job)"))  #                                                         x #1                    #John Wayne: Su, 11.01.2013 08:24:42# #2                    #Jacky Fine: Th, 02.02.2013 18:23:01# #3                  #Toe Michael : Mo, 20.05.2011 09:00:29# 

Improved solution, in order to exploit independantly each line of each bunch of three persons :

all_cases %>% separate(line,c("a","b","c"),sep="-{3,}") %>%   gather(k,v,a,b,c) %>%   transmute(x=str_extract(v,".*(?=/n.*?Does great job)")) %>%   filter(!is.na(x)) 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: