Create searchable pdf with c++ and tesseract

- May 01, 2021

Many office machines creates pdf as result of scan instead of image. Unfortunately not always they includes also text layer for copy&paste or they include text layer based on default language of scanner and not document language. In such cases you can use tesseract to crete "searchable pdf".

Example code


#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main(int argc,char* argv[]) {
    bool textonly = false;
    const char *lang = "eng";
    const char *outputbase;

    if(argc==1) {
        printf("Program usage:\n\t %s image_filename\n",argv[0]);
        return 0;
    }
    const char *image = argv[1];
    outputbase = "test";

    // suppress leptonica error messages
    setMsgSeverity(L_SEVERITY_NONE);

    auto *api = new tesseract::TessBaseAPI();
    // suppress tesseract debug messages
    api->SetVariable("debug_file", "/dev/null");
    api->Init(nullptr, lang);

    api->SetOutputName(outputbase);
    auto* renderer = new tesseract::TessPDFRenderer(outputbase,
        api->GetDatapath(), textonly);
    bool succeed = api->ProcessPages(image, nullptr, 0, renderer);
    api->End();
    delete renderer;
    delete api;
    if (!succeed) {
        fprintf(stderr, "Error during processing.\n");
        return 1;
    } else {
        fprintf(stderr, "PDF creation was successful.\n");
        return 0;
    }
}

Build and run


> "c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" x64

> SET TESSDATA_PREFIX=f:\Project\tessdata
> SET PATH=%PATH%;f:\win64\bin

> cl pdf_tesseract.cpp /std:c++17 -D_CRT_SECURE_NO_WARNINGS -D_CRT_NONSTDC_NO_DEPRECATE /If:\win64\include /link /LIBPATH:F:/WIN64/LIB tesseract41.lib leptonica-1.81.0.lib  /machine:x64
> pdf_tesseract speccoll.png

Remarks

Tesseract needs image as input file. So you can not use pdf (portable document format) as input. So have a look at previous blog how to convert pdf to images.
Tesseract use glyphless font o text is not visible. You can not replace it in tesseract with other font. Anyway - text could be select and copied.
Tesseract philosophy is not to modify input image e.g. it does not change compression format of image, so make pdf smaller. So if you would like to use e.g. png file (as it does not have artefact, that negative effect OCR engine) as input for OCR, you can use set textonly to true. Tesseract will create only pdf with text layer and than you can use other tools to embed there e.g. jpg image

Example of tesseract searchable pdf

Example of tesseract searchable pdf with option textonly = true

ramblings

Create searchable pdf with c++ and tesseract

Example code

Build and run

Remarks

Comments

Post a Comment

Popular posts from this blog

Building tesserocr on MS Windows 64bit

OpenCV and tesseract

Cross Compile Tesseract For Android On Windows 10

Preparing Windows for Tesseract "Makefile training" (LSTM training)

Flask: Drag & Drop + Click & Select example | single page app

python with CUDA/GPU support on Windows

OCR pdf file in python on the fly

Custom OCR application in C++

Tesseract LSTM training (aka Makefile training)

Visualize Tesseract Box File