Why do regex engines allow / automatically attempt matching at the end of the input string?

  • A+
Category:Languages

Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, the presence of newlines (multi-line) introduces variations in behavior of $ and . that are incidental to the questions at hand.

Most regex engines:

  • accept a regex that explicitly tries to match an expression after the end of the input string[1].

    $ python -c "import re; print(re.findall('$.*', 'a'))" [''] # !! Matched the hypothetical empty string after the end of 'a' 
  • when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, another match attempt is performed[2], as explained in this answer to a related question:

    $ python -c "import re; print(re.findall('.*$', 'a'))" ['a', ''] # !! Matched both the full input AND the hypothetical empty string 

Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).

These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:

  • it's not obvious what the benefit of this behavior is.
  • conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
    • To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
    • The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)

Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.

Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.

By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.


[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, /z.

[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[/g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].

[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) to use ^.* to prevent multiple matches from being found. (a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...": 'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b"""; option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.

 


What is the reason behind using .* with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what * quantifier is, otherwise global modifier shouldn't be set. .* without g doesn't return two matches.

it's not obvious what the benefit of this behavior is.

There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?

We have three valid places that a zero-length string exists:

  • Start of subject string
  • Between two characters
  • End of subject string

We should look for the reason rather than the benefit of that second zero-length match output using .* with g modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .* but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:

Why do regex engines allow / automatically attempt matching at the end of the input string?

That's a zero-length match (read more about epsilon transition).

These all relates to greediness and non-greediness. Without zero-length positions a regex like .?? wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.

Without a zero-length position .?? never could skip a character in input string and that results in a whole brand new flavor.

Definition of greediness / laziness leads into zero-length matches.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: