如果不使用 regexp,HTML 解析器如何工作?

我每天都会看到一些问题,问我如何解析或提取某些 HTML 字符串中的内容,第一个答案/注释总是“不要使用正则表达式解析 HTML,以免您感到愤怒!”(最后一部分有时被省略)。

这让我很困惑,我一直认为解析任何复杂字符串的最好方法是使用正则表达式。那么 HTML 解析器是如何工作的呢?它不是用正则表达式解析。

使用正则表达式的一个特殊原因是,并不总是存在解析选项(例如 JavaScript,其中 DOMDocument 不是普遍可用的选项)。例如,jQuery 似乎可以很好地使用正则表达式将 HTML 字符串转换为 DOM 节点。

不知道是否要 CW 这一点,这是一个真正的问题,我希望得到回答,并没有真正打算成为一个讨论的线索。

6530 次浏览

If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.

The reason is that this is valid HTML:

<ul>
<li>One
<li>Two
<li>Three
</ul>

But so is this:

<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>

If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).

Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".

Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.

The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).

So how does a HTML parser work? Doesn't it use regular expressions to parse?

Well, no.

If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.

The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.

A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.

A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.

Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.

The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.

To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.

HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.

Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.