正则表达式负向前看

在我的主目录中有一个 Drupal-6.14文件夹,其中包含 Drupal 平台。

在这个目录中,我使用以下命令:

find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz

这个命令所做的是 gzip 文件夹 Drupal-6.14,不包括它所包含的 Drupal-6.14/site/除网站/全部及网站/预设的所有子文件夹。

我的问题是关于正则表达式的:

grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'

表达式 工程排除所有我想排除的文件夹,但我不太明白为什么。

使用正则表达式

匹配所有字符串,除了那些 不要包含子模式 x 的字符串。或者换句话说,否定一个子模式。

我(认为)我明白解决这些问题的一般策略是使用消极的前瞻,但我从来没有理解到一个令人满意的水平积极和消极的前瞻(前面/后面)的工作。

这些年来,我读了很多关于它们的网站。PHP 和 Python regex 手册,其他页面如 http://www.regular-expressions.info/lookaround.html等等,但是我从来没有对 真的有一个扎实的理解。

有没有人能解释一下,这是如何工作的,或许能提供一些类似的例子来做类似的事情?

——更新一:

关于安多玛的回应: 双重负面的前瞻能否更简洁地表达为一个单一的积极的前瞻声明:

即:

'drupal-6.14/(?!sites(?!/all|/default)).*'

等同于:

'drupal-6.14/(?=sites(?:/all|/default)).*'

???

——更新二:

根据@Random ar 和@alan Moore 的说法,你不能把双重负面的前瞻换成正面的前瞻。

173139 次浏览

Lookarounds can be nested.

So this regex matches "drupal-6.14/" that is not followed by "sites" that is not followed by "/all" or "/default".

Confusing? Using different words, we can say it matches "drupal-6.14/" that is not followed by "sites" unless that is further followed by "/all" or "/default"

A negative lookahead says, at this position, the following regex can not match.

Let's take a simplified example:

a(?!b(?!c))


a      Match: (?!b) succeeds
ac     Match: (?!b) succeeds
ab     No match: (?!b(?!c)) fails
abe    No match: (?!b(?!c)) fails
abc    Match: (?!b(?!c)) succeeds

The last example is a double negation: it allows b followed by c. The nested negative lookahead becomes a positive lookahead: the c should be present.

In each example, only the a is matched. The lookahead is only a condition, and does not add to the matched text.

If you revise your regular expression like this:

drupal-6.14/(?=sites(?!/all|/default)).*
^^

...then it will match all inputs that contain drupal-6.14/ followed by sites followed by anything other than /all or /default. For example:

drupal-6.14/sites/foo
drupal-6.14/sites/bar
drupal-6.14/sitesfoo42
drupal-6.14/sitesall

Changing ?= to ?! to match your original regex simply negates those matches:

drupal-6.14/(?!sites(?!/all|/default)).*
^^

So, this simply means that drupal-6.14/ now cannot be followed by sites followed by anything other than /all or /default. So now, these inputs will satisfy the regex:

drupal-6.14/sites/all
drupal-6.14/sites/default
drupal-6.14/sites/all42

But, what may not be obvious from some of the other answers (and possibly your question) is that your regex will also permit other inputs where drupal-6.14/ is followed by anything other than sites as well. For example:

drupal-6.14/foo
drupal-6.14/xsites

Conclusion: So, your regex basically says to include all subdirectories of drupal-6.14 except those subdirectories of sites whose name begins with anything other than all or default.