pdf to image/multipage (in python)

Many people love pdf format as such files are readable on all common devices and it provide stable representation of document. But somehow also scanner started to support it but with one feature: they just encapsulate images to it without OCRing it, so such pdf are not reachable.

Because tesseract OCR expects as input image (pdf is document like odt, rtf, docx) user need to first solve conversion of pdf to some image. Simple solution is to extract images from pdf. We will use https://health.mo.gov/lab/pdf/PublicWaterMassMailing.pdf as example.

There are plenty of online tools but you can also use tool from poppler or xpdf project: pdfimage. E.g:

pdfimages.exe PublicWaterMassMailing.pdf pwmm

Other popular solution is to use ghostcript  (which is also bundled in virtual pdf printer PDFCreator) that  is useful if you are not sure if the pdf does not contain text or splitted images and you would like to keep original layout of pages:

gs -dNOPAUSE -r300 -sDEVICE=tiffscaled24 -sCompression=lzw -dBATCH -sOutputFile=pwmm.tif PublicWaterMassMailing.pdf

(on windows use gswin32c e.g. from "c:\Program Files\PDFCreator\Ghostscript\bin\gswin32c.exe")

But can we do it python instead of command line? Yes we can - with PyMuPDF package. BTW: it based on the MuPDF - a lightweight PDF, XPS, and E-book viewer. And created by same company as ghostscript Artifex Software, Inc..

So here it the code (pdf_to_tiff.py):

import fitz
from PIL import Image


def get_PIL_image(pix):
    """Convert fitz pix to PIL.Image."""
    img = Image.frombytes("RGB",
                          [pix.width, pix.height],
                          pix.samples)
    return img


def binarize_image(image, thresh=100):
    binarized = image.convert("L").convert("1")
    return binarized


input_pdf = "PublicWaterMassMailing.pdf"
output_name = "pwmm.tif"
# supported compressions: "zip", "lzw", "group4"
compression = "group4"  # requires binarized image

zoom = 2  # to increase the resolution
mat = fitz.Matrix(zoom, zoom)

doc = fitz.open(input_pdf)
image_list = []
for page in doc:
    pix = page.getPixmap(matrix=mat)
    pil_image = get_PIL_image(pix)
    if not pil_image:
        continue
    binarized = binarize_image(pil_image)
    image_list.append(binarized)

if image_list:
    image_list[0].save(
        output_name,
        save_all=True,
        append_images=image_list[1:],
        compression=compression,
        dpi=(300, 300),
    )

If you need higher quality of output try to play with thresh in binarize_image or use other tiff compression and skip binarization. In some project I was able to get better quality by opening image as Grayscale and then to binarize it

Comments

Popular posts from this blog

Tesseract LSTM training (aka Makefile training)