How does /G work in .split?

  • A+
Category:Languages

I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:

for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";")) 

Which basically loops over the 2-char Strings after we converted it into a String-array with .split. Someone suggested I could golf it to this instead to save 4 bytes:

for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=//G..)")) 

The functionality is still the same. It loops over the 2-char Strings.

However, neither of us was 100% sure how this works, hence this question.


What I know:

I know .split("(?<= ... )") is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:

"a;b;c;d".split("(?<=;)")            // Results in ["a;", "b;", "c;", "d"] "a;b;c;d".split("(?=;)")             // Results in ["a", ";b", ";c", ";d"] "a;b;c;d".split("((?<=;)|(?=;))")    // Results in ["a", ";", "b", ";", "c", ";", "d"] 

I know /G is used to stop after a non-match is encountered.
EDIT: /G is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to @SebastianProske.

int count = 0; java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,"); java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,"); while(matcher.find())   count++; System.out.println(count); // Results in 5  count = 0; pattern = java.util.regex.Pattern.compile("//Gmatch,"); matcher = pattern.matcher("match,match,match,blabla,match,match,"); while(matcher.find())   count++; System.out.println(count); // Results in 3 

But how does .split("(?<=//G..)") work exactly when using /G inside the split?
And why does .split("(?=//G..)") not work?

Here a "Try it online"-link for all code-snippets described above to see them in action.


how does .split("(?<=//G..)") work

(?<=X) is a zero-width positive lookbehind for X. /G is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course .. is two individual characters. So (?<=/G..) is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.

So let's walk through ABCDEF:

  1. /G matches beginning of input, and .. matches AB, so (?<=/G..) finds the zero-width space between AB and CD because this is a lookbehind: That is, the first point at which there is /G.. prior to the regex cursor is the point between AB and CD. So split between AB and CD.
  2. /G marks the location just after AB so (?<=/G..) finds the zero-width space between CD and EF, because as the regex cursor goes forward, that's the first place where /G.. matches: /G matching the location between AB and CD and .. matching CD. So split between CD and EF.
  3. Same again: /G marks the location just after CD so (?<=/G..) finds the zero-width space between EF and end-of-input. So split between EF and end-of-input.
  4. Create an array with all of the matches except the empty one at the end (because this is split with an implicit length = 0 which discards empty strings at the end).

Result { "AB", "CD", "EF" }.

And why does .split("(?=//G..)") not work?

Because (?=X) is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: