Make PDFs searchable
I searched for a good way to make scanned documents searchable. Most newer scanning software already has some OCR built-in, but what about all the old documents? Using pdfsandwich and Tesseract, we recover the text from each page of a PDF and put it behind each page as an invisible layer. That way, we can search the PDF with a normal PDF reader or upload it to Google translate to get a translated version. To get a text-only version, pdftotext can be used.
First, install the missing packages (tested on Ubuntu 12.04):Second, we download and install pdfsandwich:
# we use tesseract-ocr-deu for German
apt-get install tesseract-ocr tesseract-ocr-deu poppler-utils
apt-get install exactimage imagemagick ghostscript
Finally, we run pdfsandwich and pdftotext on a PDF:
wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.0.7/\
pdfsandwich_0.0.7_amd64.deb
dpkg -i pdfsandwich_0.0.7_amd64.deb
To process all PDFs in the current directory, find can be used:
pdfsandwich -resolution 240x240 -rgb -lang deu german_document.pdf
# creates german_document_ocr.pdf with colors and 240dpi
pdftotext german_document_ocr.pdf
# gives german_document_ocr.txt
find . -name "*.pdf" -exec pdfsandwich -resolution 240x240 -rgb -lang deu {} \;
Comments
Post a Comment