正则表达式错误-无需重复

当我使用这个表达式时,会得到一个错误消息:

re.sub(r"([^\s\w])(\s*\1)+","\\1","...")

我在 RegExr检查了正则表达式,它按预期返回 .。但是当我在 Python 中尝试它时,我得到了这个错误消息:

raise error, v # invalid expression
sre_constants.error: nothing to repeat

有人能解释一下吗?

179041 次浏览

It seems to be a python bug (that works perfectly in vim). The source of the problem is the (\s*...)+ bit. Basically , you can't do (\s*)+ which make sense , because you are trying to repeat something which can be null.

>>> re.compile(r"(\s*)+")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 180, in compile
return _compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 233, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat

However (\s*\1) should not be null, but we know it only because we know what's in \1. Apparently python doesn't ... that's weird.

That is a Python bug between "*" and special characters.

Instead of

re.compile(r"\w*")

Try:

re.compile(r"[a-zA-Z0-9]*")

It works, however does not make the same regular expression.

This bug seems to have been fixed between 2.7.5 and 2.7.6.

It's not only a Python bug with * actually, it can also happen when you pass a string as a part of your regular expression to be compiled, like ;

import re
input_line = "string from any input source"
processed_line= "text to be edited with {}".format(input_line)
target = "text to be searched"
re.search(processed_line, target)

this will cause an error if processed line contained some "(+)" for example, like you can find in chemical formulae, or such chains of characters. the solution is to escape but when you do it on the fly, it can happen that you fail to do it properly...

Beyond the bug that was discovered and fixed, I'll just note that the error message sre_constants.error: nothing to repeat is a bit confusing. I was trying to use r'?.*' as a pattern, and thought it was complaining for some strange reason about the *, but the problem is actually that ? is a way of saying "repeat zero or one times". So I needed to say r'\?.*'to match a literal ?

regular expression normally uses * and + in theory of language. I encounter the same bug while executing the line code

re.split("*",text)

to solve it, it needs to include \ before * and +

re.split("\*",text)

I had this problem when using the regex \b?. Using \s? fixed the issue (although it's not the same thing)