如何使用 grep、 regex 或 perl 提取遵循模式的字符串

我有一个类似这样的文件:

    <table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>

我需要提取 name=后面的引号中的任何内容,即 content_analyzercontent_analyzer2content_analyzer_items

我是在 Linux 机器上完成这项工作的,所以使用 sed、 perl、 grep 或 bash 的解决方案都可以。

280524 次浏览

this could do it:

perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'

The regular expression would be:

.+name="([^"]+)"

Then the grouping would be in the \1

If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.

An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder:

Program

#!/usr/bin/env perl


use strict;
use warnings;


use HTML::TreeBuilder;


my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
sub { defined $_[0]->attr('name') }
);


for (@elements) {
print $_->attr('name'), "\n";
}


__DATA__
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>

Output

content_analyzer
content_analyzer2
content_analyzer_items

Since you need to match content without including it in the result (must match name=" but it's not part of the desired result) some form of zero-width matching or group capturing is required. This can be done easily with the following tools:

Perl

With Perl you could use the n option to loop line by line and print the content of a capturing group if it matches:

perl -ne 'print "$1\n" if /name="(.*?)"/' filename

GNU grep

If you have an improved version of grep, such as GNU grep, you may have the -P option available. This option will enable Perl-like regex, allowing you to use \K which is a shorthand lookbehind. It will reset the match position, so anything before it is zero-width.

grep -Po 'name="\K.*?(?=")' filename

The o option makes grep print only the matched text, instead of the whole line.

Vim - Text Editor

Another way is to use a text editor directly. With Vim, one of the various ways of accomplishing this would be to delete lines without name= and then extract the content from the resulting lines:

:v/.*name="\v([^"]+).*/d|%s//\1

Standard grep

If you don't have access to these tools, for some reason, something similar could be achieved with standard grep. However, without the look around it will require some cleanup later:

grep -o 'name="[^"]*"' filename

A note about saving results

In all of the commands above the results will be sent to stdout. It's important to remember that you can always save them by piping it to a file by appending:

> result

to the end of the command.

Here's a solution using HTML tidy & xmlstarlet:

htmlstr='
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
'


echo "$htmlstr" | tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
sed '/type="global"/d' |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

Oops, the sed command has to precede the tidy command of course:

echo "$htmlstr" |
sed '/type="global"/d' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

If the structure of your xml (or text in general) is fixed, the easiest way is using cut. For your specific case:

echo '<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>' | grep name= | cut -f2 -d '"'