是什么让 Java 比 C 更容易解析?

我知道 C 和 C + + 的语法是上下文相关的,尤其是 C 语言中需要一个“ lexer hack”。另一方面,我的印象是,尽管两种语言有很大的相似性,但是只需要两个 look-ahead 标记就可以解析 Java。

要使 C 更易于解析,您需要修改 C 的哪些内容?

我这样问是因为我看到的所有 C 语言上下文敏感性的例子在技术上都是允许的,但是非常奇怪,

foo (a);

可以使用参数 a调用 void 函数 foo。或者,它可以将 a声明为 foo类型的对象,但是您也可以很容易地去掉括号。部分原因是 C 语法的“直接声明器”生产规则实现了声明函数和变量的双重目的。

另一方面,Java 语法对于变量声明和函数声明有单独的生成规则

foo a;

那么您就知道它是一个变量声明,并且可以将 foo明确地解析为一个类型名称。如果类 foo没有在当前作用域中的某个地方定义,那么这可能不是有效的代码,但是这是可以在以后的编译器传递中执行的语义分析工作。

我看到有人说 C 很难解析,因为 typedef,但是你也可以在 Java 中声明你自己的类型。除了 direct_declarator,还有哪些 C 语法规则是错误的?

6389 次浏览

Parsing C++ is getting hard. Parsing Java is getting to be just as hard.

See this SO answer discussing why C (and C++) is "hard" to parse. The short summary is that C and C++ grammars are inherently ambiguous; they will give you multiple parses and you must use context to resolve the ambiguities. People then make the mistake of assuming you have to resolve ambiguities as you parse; not so, see below. If you insist on resolving ambiguities as you parse, your parser gets more complicated and that much harder to build; but that complexity is a self-inflicted wound.

IIRC, Java 1.4's "obvious" LALR(1) grammar was not ambiguous, so it was "easy" to parse. I'm not so sure that modern Java hasn't got at least long distance local ambiguities; there's always the problem of deciding whether "...>>" closes off two templates or is a "right shift operator". I suspect modern Java does not parse with LALR(1) anymore.

But one can get past the parsing problem by using strong parsers (or weak parsers and context collection hacks as C and C++ front ends mostly do now), for both languages. C and C++ have the additional complication of having a preprocessor; these are more complicated in practice than they look. One claim is that the C and C++ parsers are so hard they have to be be written by hand. It isn't true; you can build Java and C++ parsers just fine with GLR parser generators.

But parsing isn't really where the problem is.

Once you parse, you will want to do something with the AST/parse tree. In practice, you need to know, for every identifier, what its definition is and where it is used ("name and type resolution", sloppily, building symbol tables). This turns out to be a LOT more work than getting the parser right, compounded by inheritance, interfaces, overloading and templates, and the confounded by the fact that the semantics for all this is written in informal natural language spread across tens to hundreds of pages of the language standard. C++ is really bad here. Java 7 and 8 are getting to be pretty awful from this point of view. (And symbol tables aren't all you need; see my bio for a longer essay on "Life After Parsing").

Most folks struggle with the pure parsing part (often never finishing; check SO itself for the many, many questions about to how to build working parsers for real langauges), so they don't ever see life after parsing. And then we get folk theorems about what is hard to parse and no signal about what happens after that stage.

Fixing C++ syntax won't get you anywhere.

Regarding changing the C++ syntax: you'll find you need to patch a lot of places to take care of the variety of local and real ambiguities in any C++ grammar. If you insist, the following list might be a good starting place. I contend there is no point in doing this if you are not the C++ standards committee; if you did so, and built a compiler using that, nobody sane would use it. There's too much invested in existing C++ applications to switch for convenience of the guys building parsers; besides, their pain is over and existing parsers work fine.

You may want to write your own parser. OK, that's fine; just don't expect the rest of the community to let you change the language they must use to make it easier for you. They all want it easier for them, and that's to use the language as documented and implemented.