有人有渲染 HTML 的差异算法吗?

我有兴趣看到一个很好的 diff 算法,可能在 Javascript 中,用于呈现两个 HTML 页面的并排 diff。其思想是,差异将显示 呈现 HTML 的差异。

为了澄清,我想能够看到并排差异 作为渲染输出。因此,如果我删除一个段落,并列视图将知道正确的空间的东西。


@ Josh 正是。虽然它可能会显示被删除的文本红色或其他东西。其思想是,如果我使用所见即所得的 HTML 内容编辑器,我不希望切换到 HTML 来做不同的。我想做它与两个所见即所得的编辑器并排也许。或者至少在最终用户友好的事情并排显示不同。

36587 次浏览

I believe a good way to do this is to render the HTML to an image and then use some diff tool that can compare images to spot the differences.

So, you expect

<font face="Arial">Hi Mom</font>

and

<span style="font-family:Arial;">Hi Mom</span>

to be considered the same?

The output depends very much on the User Agent. Like Ionut Anghelcovici suggests, make an image. Do one for every browser you care about.

There's another nice trick you can use to significantly improve the look of a rendered HTML diff. Although this doesn't fully solve the initial problem, it will make a significant difference in the appearance of your rendered HTML diffs.

Side-by-side rendered HTML will make it very difficult for your diff to line up vertically. Vertical alignment is crucial for comparing side-by-side diffs. In order to improve the vertical alignment of a side-by-side diff, you can insert invisible HTML elements in each version of the diff at "checkpoints" where the diff should be vertically aligned. Then you can use a bit of client-side JavaScript to add vertical spacing around checkpoint until the sides line up vertically.

Explained in a little more detail:

If you want to use this technique, run your diff algorithm and insert a bunch of visibility:hidden <span>s or tiny <div>s wherever your side-by-side versions should match up, according to the diff. Then run JavaScript that finds each checkpoint (and its side-by-side neighbor) and adds vertical spacing to the checkpoint that is higher-up (shallower) on the page. Now your rendered HTML diff will be vertically aligned up to that checkpoint, and you can continue repairing vertical alignment down the rest of your side-by-side page.

If it is XHTML (which assumes a lot on my part) would the Xml Diff Patch Toolkit help? http://msdn.microsoft.com/en-us/library/aa302294.aspx

For smaller differences you might be able to do a normal text-diff, and then analyse the missing or inserted pieces to see how to resolve it, but for any larger differences you're going to have a very tough time doing this.

For instance, how would you detect, and show, that a left-aligned image (floating left of a paragraph of text) has suddenly become right-aligned?

I ended up needing something similar awhile back. To get the HTML to line up side to side, you could use two iFrames, but you'd then have to tie their scrolling together via javascript as you scroll (if you allow scrolling).

To see the diff, however, you will more than likely want to use someone else's library. I used DaisyDiff, a Java library, for a similar project where my client was happy with seeing a single HTML rendering of the content with MS Word "track changes"-like markup.

HTH

Using a text differ will break on non-trivial documents. Depending on what you think is intuitive, XML differs will probably generate diffs that aren't very good for text with markup. AFAIK, DaisyDiff is the only library specialized in HTML. It works great for a subset of HTML.

If you were working with Java and XHTML, XMLUnit allows you to compare two XML documents via the org.custommonkey.xmlunit.DetailedDiff class:

Compares and describes all the differences between two XML documents. The document comparison does not stop once the first unrecoverable difference is found, unlike the Diff class.

Consider using the output of links or lynx to render a text-only version of the html, and then diff that.

Use the markup mode of Pretty Diff for HTML. It is written entirely in JavaScript.

http://prettydiff.com/

What about DaisyDiff (Java and PHP vesions available).

Following features are really nice:

  • Works with badly formed HTML that can be found "in the wild".
  • The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
  • In addition to the default visual diff, HTML source can be diffed coherently.
  • Provides easy to understand descriptions of the changes.
  • The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

Over the weekend I posted a new project on codeplex that implements an HTML diff algorithm in C#. The original algorithm was written in Ruby. I understand you were looking for a JavaScript implementation, perhaps having one available in C# with source code could assist you to port the algorithm. Here is the link if you are interested: htmldiff.codeplex.com. You can read more about it here.

UPDATE: This library has been moved to GitHub.