“Variable length lookbehind not implemented” but it isn't variable length

  • A+
Category:Languages

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.

use strict; use warnings;  my $text = "M Y H A P P Y T E X T"; my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';  if ($text =~ m/$regex/){     print "true/n"; } else {     print "false/n"; } 

This gives the error "Variable length lookbehind not implemented in regex."

I am hoping you can help with two issues:

  1. I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
  2. I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.

I have reduced your problem to this:

my $text = 'M Y H A P P Y T E X T'; my $regex = '(?<!st)A'; print ($text =~ m/$regex/i ? "true/n" : "false/n"); 

Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be is replaced by a Typographic_ligature causing it to be a variable length.

However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.

Solution: Use aa modifiers i.e.:

/(?<!st)A/iaa 

Or in your regex:

my $text = 'M Y H A P P Y T E X T'; my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd'; print ($text =~ m/$regex/iaa ? "true/n" : "false/n"); 

From perlre:

To forbid ASCII/non-ASCII matches (like "k" with "/N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the /d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.

See a closely related discussion here

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: