比较大量 PDF 文件的工具?

我需要比较它的光学内容的 PDF 文件的大数。因为 PDF 文件是在不同的平台上创建的,不同版本的软件有结构上的差异。例如:

  • 文本的组块可以是不同的
  • 写入顺序可以不同
  • 位置可以不同一些像素

它应该像人类一样比较内容,而不是内部结构。我想测试回归之间的不同版本的 PDF 生成器,我们使用。

98364 次浏览

I think your best approach would be to convert the PDF to images at a decent resolution and than do an image compare.

To generate images from PDF you can use Adobe PDF Library or the solution suggested at Best way to convert pdf files to tiff files.

To compare the generated TIFF files I found GNU tiffcmp (for windows part of GnuWin32 tiff) and tiffinfo did a good job. Use tiffcmp -l and count the number of lines of output to find any differences. If you are happy to have a small amount of content change (e.g. anti-aliasing differences) then use tiffinfo to count the total number of pixels and you can then generate a percentage difference value.

By the way for anyone doing simple PDF comparison where the structure hasn't changed it is possible to use command line diff and ignore certain patterns, e.g. with GNU diff 2.7:

diff --brief -I xap: -I xapMM: -I /CreationDate -I /BaseFont -I /ID --binary --text

This still has the problem that it doesn't always catch changes in generated font names.

I've used a home-baked script which

  • converts all pages on two PDFs to bitmaps
  • colors pages of PDF 1 to red-on-white
  • changes white to transparent on pages of PDF 2
  • overlays each page from PDF 2 on top of the corresponding page from PDF 1
  • runs conversion/coloring and overlaying in parallel on multiple cores

Software used:

  • GhostScript for PDF-to-bitmap conversion
  • ImageMagick for coloring, transparency and overlay
  • inotify for synchronizing parallel processes
  • any PNG-capable image viewer for reviewing the result

Pros:

  • simple implementation
  • all tools used are open source
  • great for finding small differences in layout

Cons:

  • the conversion is slow
  • major differences between PDFs (e.g. pagination) result in a mess
  • bitmaps are not zoomable
  • only works well for black-and-white text and diagrams
  • no easy-to-use GUI

I've been looking for a tool which would do the same on PDF/PostScript level.

Here's how our script invokes the utilities (note that ImageMagick uses GhostScript behind the scenes to do the PDF->PNG conversion):

$ convert -density 150x150 -fill red -opaque black +antialias 1.pdf back%02d.png
$ convert -density 150x150 -transparent white +antialias 2.pdf front%02d.png
$ composite front01.png back01.png result01.png # do this for all pairs of images

We've also used pdftotext (see Sklivvz's answer) to generate ASCII versions of PDFs and wdiff to compare them.

Use pdftotext's -layout switch to enhance readability and get some idea of changes in the layout.

To get nice colored output from wdiff, use this wrapper script:

#!/bin/sh
RED=$'\e'"[1;31m"
GREEN=$'\e'"[1;32m"
RESET=$'\e'"[0m"
wdiff -w$RED -x$RESET -y$GREEN -z$RESET -n $1 $2

Because there is no such tool available that we have written one. You can download the i-net PDF content comparer and use it. I hope that help other with the same problem. If you have problems with it or you have feedback for us then you can contact our support.

enter image description here

blubeam pdf software will do this for you

You can batch compare pdf files with Tarkware Pdf Comparer. But it's not free and requires Adobe Acrobat.

Our product, PDF Comparator - http://www.premediasystems.com/pdfc.html" - will do this quite elegantly and efficiently. It's also not free, and is a Mac OS X only application.

Based on your needs, a convert to text solution would be the easiest and most direct. I did think the bitmap idea was pretty cool.

There is actually a diffpdf tool.

http://www.qtrac.eu/diffpdf.html

Its weakness is that it doesn't react well when additions make new text shift partially to a new page. For instance, if old page 4 should be compared to the end of page 5 and the beginning of page 6, you'll need to shift parameters to compare the two slices separately.

I don't seem to be able to see this here, so here it is: via superuser: How to compare the differences between two PDF files? (answer #229891, by @slestak), there is

https://github.com/vslavik/diff-pdf

(build steps for Ubuntu Natty can be found in get-diff-pdf.sh)

As far as I can see, it basically overlays the text/graphics of each page in the pdf(s), allowing you to easily see if there were any changes...

Cheers!