Software Freedom Law Center

pdfdiff, A semi-generic pdf diffing and merging utility

Rationale

Many people who send us documents are either clueless, nefarious, or both. Despite a world of FLOSS document collaboration tools, people who don't "get" FLOSS choose to collaborate on documents by emailing PDFs around. While some believe it is useful to circulate PDFs to encourage feedback in a format other than document edits, many are still frustrated by this practice, especially if the PDF is not accompanied by a diff or other way mechanisms that shows changes from previous versions.

Even if you have previous versions of the document, the new PDF cannot be easily compared to the new version. If you want to make changes, you have to import it into some other software and attempt to get it formatted properly. This is such a pointless hassle, when of course, 99.999% of the time, the document was never edited as a PDF (some other more sane software was used - except, perhaps in the case where the software used to generated the PDF was Microsoft Word).

Sometimes, the sender doesn't realize the inconvenience caused by distributing the document as a PDF. Other times, they may not want you to easily figure out what changed or want you to acquiesce to their control of the document. Either way, in these circumstances we shouldn't be beholden to the clueless nor the nefarious. pdfdiff tries to get around this problem.

What It Does

pdfdiff attempts to do anything it can to extract the text from two PDFs, show you the differences, and try to help you merge the versions if you want to. The basic process for detection of the text content works like this:

  • See if the PDF actually wraps some reasonable embedded text and extract it.
  • If no embedded text is found:
    • Attempt various OCR mechanisms to extract text from the images
    • If running interactively, ask the user for help orienting the pages
  • Use various heuristic mechanisms to try to format the text in a reasonable manner.

Once text is available for both PDFs, it does a standard diff, a wdiff, a meld or other diff/merging'ing operation based on user preferences.

What It Uses

The embedded text extraction is done by poppler-utils' pdftotext. (I prefer poppler-utils to xpdf because it has a more community-oriented development process from what I've read.)

The OCR systems used are:

  • tesseract
  • gocr
  • ocrad

Once OCR data is extracted, various heuristic techniques are used to guess at which one gave the best output. This is primarily done by the Lingua::Ident and Lingua::Identify Perl modules.

Various Perl modules are used to try to format the text in a reasonable manner after all this. That's primarily done by Damian Conway's Text::Autoformat.

Development

pdfdiff is part of the Loblaw project, so development is done via its development resources for now.

Other Ideas

These are merely ideas for where this program might be taken. Please don't assume these are commitments by any developer of the project to take on or complete this work.

univdiff

It would be interesting to expand the utility from pdfdiff to something called univdiff, which would take any two file formats and attempt to normalize them down to the lowest common denominator and merge them.

On the backend, this could put out any number of formats for final editing. For example, it could simply output a change-tracked ODF file, for a LaTeX file with markup.

pandoc might be a helpful utility to make this happen.

Other PDF Programs Worth Mentioning

  • There is another program called pdfdiff, although it is a pretty simple wrapper to poppler-utils
  • pdftk, a program to do various PDF manipulations and form-fill-ins.
  • flpsed, a more generic Postscript file editor for inserting text.
  • Poppler, the PDF manipulation library and associated CLI programs (a fork of XPDF).
  • XPDF

SFLC Main Page

[frdm] Support SFLC