如何使用 sed/grep 提取两个单词之间的文本？

小开

最佳答案

sed -e 's/Here\(.*\)String/\1/'

小开

你可以使用 \1(请参阅 http://www.grymoire.com/Unix/Sed.html#uh-4) :

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

括号内的内容将被存储为 \1。

小开

您可以单独在巴斯中去掉字符串:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

如果您有一个包含 PCRE的 GNU grep，您可以使用零宽度断言:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

小开

这可能对您有用(GNU sed) :

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

这在一个换行中表示两个标记(在本例中为 Here和 String)之间的文本的每个表示形式，并在文本中保留换行。

小开

GNU grep 还支持积极和消极的前瞻和回顾: 对于您的情况，命令应该是:

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果多次出现 Here和 string，您可以选择是否要从第一个 Here和最后一个 string匹配或单独匹配它们。就 regex 而言，它被称为贪婪的匹配(第一种情况)或非贪婪匹配(第二种情况)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
is a string, and Here is another
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
is a
is another

小开

如果您有一个很长的文件，其中有许多多行的匹配项，那么首先打印数字行是很有用的:

cat -n file | sed -n '/Here/,/String/p'

小开

通过 GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
is a

Grep 使用 -P(Perl-regexp)参数支持 \K，这有助于丢弃以前匹配的字符。在我们的示例中，先前匹配的字符串是 Here，因此它从最终输出中被丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
is a
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
is a

如果您希望输出为 is a，那么您可以尝试下面的代码,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

小开

上述所有解决方案都存在缺陷，即最后一个搜索字符串在字符串的其他地方重复。我发现最好写一个 bash 函数。

    function str_str {
local str
str="${1#*${2}}"
str="${str%%$3*}"
echo -n "$str"
}


# test it ...
mystr="this is a string"
str_str "$mystr" "this " " string"

小开

接受的答案不会删除可能在 Here之前或 String之后的文本，这将:

sed -e 's/.*Here\(.*\)String.*/\1/'

主要区别是在 Here之前和 String之后立即加入 .*。

小开

问题。 我存储的 Claw 邮件消息包装如下，我试图提取主题行:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
link in major cell growth pathway: Findings point to new potential
therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
identified [Lysosomal amino acid transporter SLC38A9 signals arginine
sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

每个 A2在这个线程中，如何使用 sed/grep 提取两个单词之间的文本？的第一个表达式，下面，“工作”只要匹配的文本不包含换行:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01


[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

然而，尽管尝试了许多变体(.+?; /s; ...) ，我还是无法让它们发挥作用:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

解决方案1。

按提取不同行上两个字符串之间的文本

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

所以

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

解决方案2 *

按如何使用 sed 替换换行符(n) ？

sed ':a;N;$!ba;s/\n/ /g' corpus/01

将用空格替换换行。

把它和如何使用 sed/grep 提取两个单词之间的文本？中的 A2连接起来，我们得到:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

所以

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

这个变体删除了双重空格:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给予

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

小开

为了理解 sed命令，我们必须一步一步地构建它。

这是你的原文

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

让我们尝试在 sed中使用 substition 选项删除 Here字符串

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

在这一点上，我相信您也能够删除 String

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

但这不是你想要的结果。

若要组合两个 sed 命令，请使用 -e选项

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

希望这个能帮上忙

小开

您可以使用两个 s 命令

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
is a

也行

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a


$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a

小开

`ripgrep`

下面是使用 rg的例子:

$ echo Here is a String | rg 'Here\s(.*)\sString' -r '$1'
is a

小开

以下是我不那么优雅但有效的解决方案:

$ echo 'Here is a String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'


is a

但也与 Here is a String Here is a second String合作:

$ echo 'Here is a String Here is a second String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'


is a
is a second

或:

$ echo 'Here is a String Here is a second String Here is last String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'


is a
is a second
is last