如何使用 Node.js 解析 HTML 页面

我需要解析(服务器端)大量的 HTML 页面。
我们都同意这里不应该使用 regexp。
在我看来,javascript 是解析 HTML 页面的本地方式,但是这种假设依赖于服务器端代码拥有 javascript 在浏览器中具有的所有 DOM 能力。

Node.js 内置了这种能力吗?
在服务器端解析 HTML 是否有更好的方法来解决这个问题?

150987 次浏览

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

Htmlparser2 by FB55 seems to be a good alternative.

jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.

Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

November 2020 Update

I searched for the top NodeJS html parser libraries.

Because my use cases didn't require a library with many features, I could focus on stability and performance.

By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.

Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.

I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):

Last commit is in the last 6 months:

jsdom - Last commit: 3 Months, Open issues: 331, Github stars: 14.9K.

htmlparser2 - Last commit: 8 days, Open issues: 2, Github stars: 2.7K.

parse5 - Last commit: 2 Months, Open issues: 21, Github stars: 2.5K.

swagger-parser - Last commit: 2 Months, Open issues: 48, Github stars: 663.

html-parse-stringify - Last commit: 4 Months, Open issues: 3, Github stars: 215.

node-html-parser - Last commit: 7 days, Open issues: 15, Github stars: 205.

Last commit is 6 months and above:

cheerio - Last commit: 1 year, Open issues: 174, Github stars: 22.9K.

koa-bodyparser - Last commit: 6 months, Open issues: 9, Github stars: 1.1K.

sax-js - Last commit: 3 Years, Open issues: 65, Github stars: 941.

draftjs-to-html - Last commit: 1 Year, Open issues: 27, Github stars: 233.


I picked Node-html-parser because it seems quiet fast and very active at this moment.

(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.

(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.