Create searchable pdf with c++ and tesseract

Many office machines creates pdf as result of scan instead of  image. Unfortunately not always they includes also text layer for copy&paste or they include text layer based on default language of scanner and not document language. In such cases you can use  tesseract to crete "searchable pdf".

Example code


#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main(int argc,char* argv[]) {
    bool textonly = false;
    const char *lang = "eng";
    const char *outputbase;

    if(argc==1) {
        printf("Program usage:\n\t %s image_filename\n",argv[0]);
        return 0;
    }
    const char *image = argv[1];
    outputbase = "test";

    // suppress leptonica error messages
    setMsgSeverity(L_SEVERITY_NONE);

    auto *api = new tesseract::TessBaseAPI();
    // suppress tesseract debug messages
    api->SetVariable("debug_file", "/dev/null");
    api->Init(nullptr, lang);

    api->SetOutputName(outputbase);
    auto* renderer = new tesseract::TessPDFRenderer(outputbase,
        api->GetDatapath(), textonly);
    bool succeed = api->ProcessPages(image, nullptr, 0, renderer);
    api->End();
    delete renderer;
    delete api;
    if (!succeed) {
        fprintf(stderr, "Error during processing.\n");
        return 1;
    } else {
        fprintf(stderr, "PDF creation was successful.\n");
        return 0;
    }
}

Build and run


> "c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" x64
> SET TESSDATA_PREFIX=f:\Project\tessdata
> SET PATH=%PATH%;f:\win64\bin
> cl pdf_tesseract.cpp /std:c++17 -D_CRT_SECURE_NO_WARNINGS -D_CRT_NONSTDC_NO_DEPRECATE /If:\win64\include /link /LIBPATH:F:/WIN64/LIB tesseract41.lib leptonica-1.81.0.lib  /machine:x64
> pdf_tesseract speccoll.png

Remarks

  1. Tesseract needs image as input file. So you can not use pdf (portable document format) as input. So have a look at previous blog how to convert pdf to images.
  2. Tesseract use glyphless font  o text is not visible. You can not replace it in tesseract  with other font. Anyway - text could be select and copied.
  3. Tesseract philosophy is not to modify input image e.g. it does not change compression format of image, so make pdf smaller. So if you would like to use e.g. png file (as it does not have artefact, that negative effect OCR engine) as input for OCR, you can use set textonly to true. Tesseract will create only pdf with text layer and than you can use other tools to embed there e.g. jpg image

Example of tesseract searchable pdf

Example of tesseract searchable pdf with option textonly = true

Comments

Popular posts from this blog

Tesseract LSTM training (aka Makefile training)