删除()和[]之间的文本

我有一个非常长的字符串与 ()[]在其中。我试图删除括号和括号之间的字符,但我不知道如何做。

这份清单类似于:

x = "This is a sentence. (once a day) [twice a day]"

这个列表不是我正在处理的,但是非常相似,而且要短得多。

133733 次浏览

括号应该可以。正则表达式将“消耗”它所匹配的文本,因此它不会对嵌套括号起作用。

import re
regex = re.compile(".*?\((.*?)\)")
result = re.findall(regex, mystring)

或者这会找到一组括号,只需循环查找更多:

start = mystring.find("(")
end = mystring.find(")")
if start != -1 and end != -1:
result = mystring[start+1:end]

运行这个脚本,它甚至可以使用嵌套的括号。
使用基本的逻辑测试。

def a(test_str):
ret = ''
skip1c = 0
skip2c = 0
for i in test_str:
if i == '[':
skip1c += 1
elif i == '(':
skip2c += 1
elif i == ']' and skip1c > 0:
skip1c -= 1
elif i == ')'and skip2c > 0:
skip2c -= 1
elif skip1c == 0 and skip2c == 0:
ret += i
return ret


x = "ewq[a [(b] ([c))]] This is a sentence. (once a day) [twice a day]"
x = a(x)
print x
print repr(x)

以防你不发表,
输出如下:

>>>
ewq This is a sentence.
'ewq This is a sentence.  '

可以使用 re.sub 函数。

>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", x)
'This is a sentence. () []'

如果要删除[]和() ,可以使用以下代码:

>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("[\(\[].*?[\)\]]", "", x)
'This is a sentence.  '

重要提示: 此代码不适用于嵌套符号

解释

第一个正则表达式将 ([分组到第1组(用括号括起来) ,将 )]分组到第2组,匹配这些组和它们之间的所有字符。匹配之后,将匹配的部分替换为组1和组2,最后的字符串在括号内没有任何内容。第二个正则表达式是不言自明的-> match everything 并用空字符串替换。

Ajay Thomas改编自 评论

下面是一个类似于 @ Pradyung 的回答的解决方案(它使用任意嵌套的括号) :

def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0  # keep it
else:  # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)


print(repr(remove_text_inside_brackets(
"This is a sentence. (once a day) [twice a day]")))
# -> 'This is a sentence.  '

您可以再次拆分、筛选和连接字符串。

import re
x = "".join(re.split("\(|\)|\[|\]", x)[::2])

您可以尝试这样做。可以删除括号和内容存在于它。

 import re
x = "This is a sentence. (once a day) [twice a day]"
x = re.sub("\(.*?\)|\[.*?\]","",x)
print(x)

预期产出:

This is a sentence.

RegEx\(.*?\)|\[.*?\]通过查找对来删除括号内容,首先删除括号,然后删除方括号。对于嵌套的括号,我也可以很好地工作,因为它是按顺序进行操作的。当然,如果出现括号不好的情况,它也会中断。


_brackets = re.compile("\(.*?\)|\[.*?\]")
_spaces = re.compile("\s+")
    

_b = _brackets.sub(" ", "microRNAs (miR) play a role in cancer ([1], [2])")
_s = _spaces.sub(" ", _b.strip())
print(_s)
    

# OUTPUT: microRNAs play a role in cancer


对于那些欣赏 jvallver 提供的简单答案并希望从代码中获得更多可读性的人来说:

>>> import re
>>> x = 'This is a sentence. (once a day) [twice a day]'
>>> opening_braces = '\(\['
>>> closing_braces = '\)\]'
>>> non_greedy_wildcard = '.*?'
>>> re.sub(f'[{opening_braces}]{non_greedy_wildcard}[{closing_braces}]', '', x)
'This is a sentence.  '

代码中包含了对这个正则表达式工作原因的大部分解释。你未来的自己会感谢你额外的3行。

(为了与 Python 2兼容,将 f-string 替换为等效的字符串串联)