re.sub(“.*”, “, ”(replacement)“, ”text”) doubles replacement on Python 3.7

  • A+
Category:Languages

On Python 3.7 (tested on Windows 64 bits), the replacement of a string using the RegEx .* gives the input string repeated twice!

On Python 3.7.2:

>>> import re >>> re.sub(".*", "(replacement)", "sample text") '(replacement)(replacement)' 

On Python 3.6.4:

>>> import re >>> re.sub(".*", "(replacement)", "sample text") '(replacement)' 

On Python 2.7.5 (32 bits):

>>> import re >>> re.sub(".*", "(replacement)", "sample text") '(replacement)' 

What is wrong? How to fix that?

 


This is not a bug, but a bug fix in Python 3.7 from the commit fbb490fd2f38bd817d99c20c05121ad0168a38ee.

In regex, a non-zero-width match moves the pointer position to the end of the match, so that the next assertion, zero-width or not, can continue to match from the position following the match. So in your example, after .* greedily matches and consumes the entire string, the fact that the pointer is then moved to the end of the string still actually leaves "room" for a zero-width match at that position, as can be evident from the following code, which behaves the same in Python 2.7, 3.6 and 3.7:

>>> re.findall(".*", 'sample text') ['sample text', ''] 

So the bug fix, which is about replacement of a zero-width match right after a non-zero-width match, now correctly replaces both matches with the replacement text.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: