“片段”在 ANTLR 中是什么意思?

What does 碎片 mean in ANTLR?

这两条规则我都见过:

fragment DIGIT : '0'..'9';

还有

DIGIT : '0'..'9';

有什么区别吗?

38419 次浏览

片段在某种程度上类似于内联函数: 它使语法更易于阅读和维护。

片段永远不会被算作标记,它只是用来简化语法。

考虑一下:

NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;

在本例中,匹配 NUMBER 将始终向 lexer 返回一个 NUMBER,而不管它是否匹配“1234”、“0xab12”或“0777”。

参见第三项

根据《最终 Antlr4》参考书:

带有片段前缀的规则只能从其他 lexer 规则中调用; 它们本身不是标记。

actually they'll improve readability of your grammars.

look at this example :

STRING : '"' (ESC | ~["\\])* '"' ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;

STRING 是一个使用片段规则(如 ESC)的 lexer。在 Esc 规则中使用 Unicode,在 Unicode 片段规则中使用十六进制。 ESC 和 UNICODE 以及 HEX 规则不能明确使用。

这个 博客文章有一个非常明显的例子,fragment在这方面有显著的不同:

grammar number;


number: INT;
DIGIT : '0'..'9';
INT   :  DIGIT+;

语法将识别“42”,但不识别“7”。您可以通过使数字成为片段(或在 INT 之后移动 DIGIT)来修复它。

最终的 ANTLR 4参考文献(第106页) :

Rules prefixed with fragment can 只能从其他 lexer 规则中调用; 它们本身不是标记


抽象概念:

案例1: (如果我需要 RULE1、 RULE2、 RULE3实体或组信息)

rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;


案例2: (如果我不关心 RULE1、 RULE2、 RULE3,我只关注 RULE0)

RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
// RULE0 is a terminal node.
// You can't name it 'rule0', or you will get syntax errors:
// 'A-C' came as a complete surprise to me while matching alternative
// 'DEF' came as a complete surprise to me while matching alternative


Case3: (相当于 Case2,使其比 Case2更易读)

RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
// You can't name it 'rule0', or you will get warnings:
// warning(125): implicit definition of token RULE1 in parser
// warning(125): implicit definition of token RULE2 in parser
// warning(125): implicit definition of token RULE3 in parser
// and failed to capture rule0 content (?)


案例1和案例2/3的区别?

  1. The lexer rules are equivalent
  2. Case1中的 RULE1/2/3都是一个捕获组,类似于 Regex: (X)
  3. Case 3中的 RULE1/2/3都是非捕获组,类似于 Regex: (? : X) enter image description here



让我们看一个具体的例子。

目标: 确定 [ABC]+[DEF]+[GHI]+标记

input.txt

ABBCCCDDDDEEEEE ABCDE
FFGGHHIIJJKK FGHIJK
ABCDEFGHIJKL


总部

import sys
from antlr4 import *
from AlphabetLexer import AlphabetLexer
from AlphabetParser import AlphabetParser
from AlphabetListener import AlphabetListener


class MyListener(AlphabetListener):
# Exit a parse tree produced by AlphabetParser#content.
def exitContent(self, ctx:AlphabetParser.ContentContext):
pass


# (For Case1 Only) enable it when testing Case1
# Exit a parse tree produced by AlphabetParser#rule0.
def exitRule0(self, ctx:AlphabetParser.Rule0Context):
print(ctx.getText())
# end-of-class


def main():
file_name = sys.argv[1]
input = FileStream(file_name)
lexer = AlphabetLexer(input)
stream = CommonTokenStream(lexer)
parser = AlphabetParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))


listener = MyListener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def


main()


个案1及结果:

字母表.g4(个案1)

grammar Alphabet;


content : (rule0|ANYCHAR)* EOF;


rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;


ANYCHAR : . -> skip;

结果:

# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL


$ python3 Main.py input.txt
(content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
ABBCCC
DDDDEEEEE
ABC
DE
FF
GGHHII
F
GHI
ABC
DEF
GHI


个案2/3及结果:

字母表.g4(个案2)

grammar Alphabet;


content : (RULE0|ANYCHAR)* EOF;


RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;


ANYCHAR : . -> skip;

字母表.g4(个案3)

grammar Alphabet;


content : (RULE0|ANYCHAR)* EOF;


RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;


ANYCHAR : . -> skip;

Result:

# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL


$ python3 Main.py input.txt
(content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)

你看到 “捕捉群体”“非捕获组”的零件了吗?




让我们看看具体的例子2。

目标: 确定八进制/十进制/十六进制数字

input.txt

0
123
1~9999
001~077
0xFF, 0x01, 0xabc123


编号 G4

grammar Number;


content
: (number|ANY_CHAR)* EOF
;


number
: DECIMAL_NUMBER
| OCTAL_NUMBER
| HEXADECIMAL_NUMBER
;


DECIMAL_NUMBER
: [1-9][0-9]*
| '0'
;


OCTAL_NUMBER
: '0' '0'..'9'+
;


HEXADECIMAL_NUMBER
: '0x'[0-9A-Fa-f]+
;


ANY_CHAR
: .
;


总部

import sys
from antlr4 import *
from NumberLexer import NumberLexer
from NumberParser import NumberParser
from NumberListener import NumberListener


class Listener(NumberListener):
# Exit a parse tree produced by NumberParser#Number.
def exitNumber(self, ctx:NumberParser.NumberContext):
print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
# end-of-def
# end-of-class


def main():
input = FileStream(sys.argv[1])
lexer = NumberLexer(input)
stream = CommonTokenStream(lexer)
parser = NumberParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))


listener = Listener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def


main()


结果:

# Input data (for reference)
# 0
# 123
#  1~9999
#  001~077
# 0xFF, 0x01, 0xabc123


$ python3 Main.py input.txt
(content (number 0) \n (number 123) \n   (number 1) ~ (number 9999) \n   (number 001) ~ (number 077) \n (number 0xFF) ,   (number 0x01) ,   (number 0xabc123) \n <EOF>)
0, dec: 0       , oct: None    , hex: None
123, dec: 123     , oct: None    , hex: None
1, dec: 1       , oct: None    , hex: None
9999, dec: 9999    , oct: None    , hex: None
001, dec: None    , oct: 001     , hex: None
077, dec: None    , oct: 077     , hex: None
0xFF, dec: None    , oct: None    , hex: 0xFF
0x01, dec: None    , oct: None    , hex: 0x01
0xabc123, dec: None    , oct: None    , hex: 0xabc123

如果您将修饰符“片段”添加到 DECIMAL_NUMBEROCTAL_NUMBERHEXADECIMAL_NUMBER,您将无法捕获数字实体(因为它们不再是令牌)。结果就是:

$ python3 Main.py input.txt
(content 0 \n 1 2 3 \n   1 ~ 9 9 9 9 \n   0 0 1 ~ 0 7 7 \n 0 x F F ,   0 x 0 1 ,   0 x a b c 1 2 3 \n <EOF>)