Make PDFs searchable

I searched for a good way to make scanned documents searchable. Most newer scanning software already has some OCR built-in, but what about all the old documents? Using pdfsandwich and Tesseract, we recover the text from each page of a PDF and put it behind each page as an invisible layer. That way, we can search the PDF with a normal PDF reader or upload it to Google translate to get a translated version. To get a text-only version, pdftotext can be used.

First, install the missing packages (tested on Ubuntu 12.04):

# we use tesseract-ocr-deu for German
apt-get install tesseract-ocr tesseract-ocr-deu poppler-utils
apt-get install exactimage imagemagick ghostscript
Second, we download and install pdfsandwich:

wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.0.7/\
pdfsandwich_0.0.7_amd64.deb
dpkg -i pdfsandwich_0.0.7_amd64.deb
Finally, we run pdfsandwich and pdftotext on a PDF:

pdfsandwich -resolution 240x240 -rgb -lang deu german_document.pdf
# creates german_document_ocr.pdf with colors and 240dpi

pdftotext german_document_ocr.pdf
# gives german_document_ocr.txt
To process all PDFs in the current directory, find can be used:

find . -name "*.pdf" -exec pdfsandwich -resolution 240x240 -rgb -lang deu {} \;

Comments

Popular posts from this blog

How to show only month and year fields in android Date-picker?

How to construct a B+ tree with example

Conflict Serializability in database