Software Freedom Law Center

root/trunk/apps/pdfdiff/README

Revision 63, 3.8 kB (checked in by bkuhn, 8 months ago)

r69@hughes: bkuhn | 2008-04-27 17:59:16 -0400

  • Initial version of software files
Line 
1 = pdfdiff, A semi-generic pdf diffing and merging utility =
2
3 == Rationale ==
4
5 Most people who send us are either clueless, nefarious, or both.  Despite
6 a world of FLOSS document collaboration tools, people who don't "get"
7 FLOSS insist on doing document collaboration and/or distribution by
8 emailing PDFs around.
9
10 If they've sent you previous versions, the new PDF cannot be easily
11 compared to the new version.  If you want to make changes, you have to
12 import it into some other software and attempt to get it formatted
13 properly.  This is such a pointless hassle, when of course, 99.999% of the
14 time, *they* didn't edit as a PDF, they used some other more sane software
15 (except, perhaps in the case where the software used to generated the PDF
16 was Microsoft Word).
17
18 Sometimes, they don't realize this is absolutely the worst way to share a
19 document.  Other times, they probably just don't want you to easily figure
20 out what changed, and merely acquiesce to their control of the document.
21 Either way, we shouldn't be beholden to the clueless nor the nefarious.
22 pdfdiff tries to get around this problem.
23
24 == What It Does ==
25
26 pdfdiff attempts to do anything it can to extract the text from two PDFs,
27 show you the differences, and try to help you merge the versions if you
28 want to.  The basic process for detection of the text content works like
29 this:
30
31   * See if the PDF actually wraps some reasonable embedded text and extract it.
32   * If no embedded text is found:
33       * Attempt various OCR mechanisms to extract text from the images
34       * If running interactively, ask the user for help orienting the pages
35   * Use various heuristic mechanisms to try to format the text in a reasonable manner.
36
37 Once text is available for both PDFs, it does a standard diff, a wdiff, a meld or other diff/merging'ing operation based on user preferences.
38
39 == What It Uses ==
40
41 The embedded text extraction is done by poppler-utils' pdftotext.  (I prefer poppler-utils to xpdf because it has a more community-oriented development process from what I've read.)
42
43 The OCR systems used are:
44    * tesseract
45    * gocr
46    * ocrad
47
48 Once OCR data is extracted, various heuristic techniques are used to guess at which one gave the best output.  This is primarily done by the Lingua::Ident and Lingua::Identify Perl modules.
49
50 Various Perl modules are used to try to format the text in a reasonable manner after all this.  That's primarily done by Damian Conway's Text::Autoformat.
51
52 == Development ==
53
54 pdfdiff is part of the [wiki:WikiStart Loblaw] project, so development is done via its development resources for now.
55
56 == Other Ideas ==
57
58 These are merely ideas for where this program might be taken.  Please don't assume these are commitments by any developer of the project to take on or complete this work.
59
60 === univdiff ===
61
62 It would be interesting to expand the utility from pdfdiff to something called univdiff, which would take any two file formats and attempt to normalize them down to the lowest common denominator and merge them.
63
64 On the backend, this could put out any number of formats for final editing.  For example, it could simply output a change-tracked ODF file, for a LaTeX file with markup.
65
66 [http://johnmacfarlane.net/pandoc/ pandoc] might be a helpful utility to make this happen.
67
68 == Other PDF Programs Worth Mentioning ==
69
70  * There is [http://people.inf.ethz.ch/cremersc/misc/pdfdiff.html another program called pdfdiff], although it is a pretty simple wrapper to poppler-utils
71  * [http://www.accesspdf.com/pdftk/ pdftk], a program to do various PDF manipulations and form-fill-ins.
72  * [http://www.ecademix.com/JohannesHofmann/flpsed.html flpsed], a more generic Postscript file editor for inserting text.
73  * [http://poppler.freedesktop.org/ Poppler], the PDF manipulation library and associated CLI programs (a fork of XPDF).
74  * [http://www.foolabs.com/xpdf/ XPDF]
75
Note: See TracBrowser for help on using the browser.

SFLC Main Page

[frdm] Support SFLC