How to view a color diff of text from two PDFs
I recently had to look at two different legal documents (PDFs) that were mostly the same, but I wanted to spot any differences.
My first attempt was to install the cross-platform GUI tool DiffPDF. This mostly worked but was annoying because it seemed to basically only diff within a page – so if your paragraph bumped to another page in the new version of the document, you couldn’t really spot changes.
Then I installed pdftotext with macports: “sudo port installĀ poppler”. From there you can diff the two text files, like normal.
However the format was still not ideal in the standard FileMerge, “diff” or “vimdiff” programs (even with ignoring whitespace) because it would show an entire paragraph changing even if just one word did. And it would also show issues in all the places where text wrapping happened on. different words
So finally I found a couple other useful tools, wdiff (for word-by-word rather than line-by-line comparisons) and colordiff to colorize the output. Installed with: sudo port install wdiff colordiff. And threw in a trick to avoid needing the .txt temp files.
$ wdiff <(pdftotext old.pdf -) <(pdftotext new.pdf -) | colordiff
Beautiful! Now I can easily spot the specific words, nicely colored, that have changed.
Only thing I would improve is for it to print only the lines with changes (plus 3 context lines above/below), rather than all lines.
FIl said,
March 18, 2015 @ 8:04 pm
Hi! It’s pretty easy to only show lines with changes — just pipe into grep with the -C (context) flag for lines containing the string indicative of changes, something like this:
wdiff <(pdftotext TCS-Seller-Guide-01-08-2014.pdf -) <(pdftotext TCS-Seller-Guide-08-05-2014.pdf -) |\
colordiff |\
grep -C 3 '\[-'