May 20, 2017

Text-only PDF compression, assembly tooling

In 2006 the renowed Adam Langley released a jbig2 encoder, later revealed to be in use for "Google Books". The blog post on it seems removed nowadays, the only trace found here. Acknowledging its downsides for archival purposes of smallprint characters as noted in the Xerox incident in 2013, it's very much an effective compression format for bi-level documents. I use the instructions from the repo for jbig2 options and pdf.py usage and clutter additional sorting and pagesize options alongside for final output. Here an example for a scanned and reassembled newspaper article:

./jbig2 -s -p -v *.png && ./pdf.py output > output.pdf

gs \
 -sOutputFile=seite-txt.pdf \
 -sDEVICE=pdfwrite \
 -dDEVICEWIDTHPOINTS=269.28 \
 -dDEVICEHEIGHTPOINTS=822 \
 -dCompatibilityLevel=1.4 \
 -dNOPAUSE \
 -dBATCH \
 -dPDFFitPage \
  output.pdf

pdftk scan-pic-0*.pdf cat output scan-pic.pdf
pdftk A=seite-txt.pdf B=scan-pic.pdf cat A1 B1 A2 B2 A3 B3 output "seitezwei.pdf"

rm -f output*

Btw, if you write in LibreOffice and ever have to send out multiple variations of a document with the same attachments, there's a --headless mode to automate the process in a pdftk pipeline with other outputs. For version control beyond the LO builtin per-document functionality, there's a odt2txt git extension.

libreoffice --headless --convert-to pdf resume/cv.odt --outdir resume