Create searchable pdf with c++ and tesseract
Many office machines creates pdf as result of scan instead of image. Unfortunately not always they includes also text layer for copy&paste or they include text layer based on default language of scanner and not document language. In such cases you can use tesseract to crete "searchable pdf".
Example code
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
int main(int argc,char* argv[]) {
bool textonly = false;
const char *lang = "eng";
const char *outputbase;
if(argc==1) {
printf("Program usage:\n\t %s image_filename\n",argv[0]);
return 0;
}
const char *image = argv[1];
outputbase = "test";
// suppress leptonica error messages
setMsgSeverity(L_SEVERITY_NONE);
auto *api = new tesseract::TessBaseAPI();
// suppress tesseract debug messages
api->SetVariable("debug_file", "/dev/null");
api->Init(nullptr, lang);
api->SetOutputName(outputbase);
auto* renderer = new tesseract::TessPDFRenderer(outputbase,
api->GetDatapath(), textonly);
bool succeed = api->ProcessPages(image, nullptr, 0, renderer);
api->End();
delete renderer;
delete api;
if (!succeed) {
fprintf(stderr, "Error during processing.\n");
return 1;
} else {
fprintf(stderr, "PDF creation was successful.\n");
return 0;
}
}
Build and run
> "c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" x64
> SET TESSDATA_PREFIX=f:\Project\tessdata
> SET PATH=%PATH%;f:\win64\bin
> cl pdf_tesseract.cpp /std:c++17 -D_CRT_SECURE_NO_WARNINGS -D_CRT_NONSTDC_NO_DEPRECATE /If:\win64\include /link /LIBPATH:F:/WIN64/LIB tesseract41.lib leptonica-1.81.0.lib /machine:x64
> pdf_tesseract speccoll.png
Remarks
- Tesseract needs image as input file. So you can not use pdf (portable document format) as input. So have a look at previous blog how to convert pdf to images.
- Tesseract use glyphless font o text is not visible. You can not replace it in tesseract with other font. Anyway - text could be select and copied.
- Tesseract philosophy is not to modify input image e.g. it does not change compression format of image, so make pdf smaller. So if you would like to use e.g. png file (as it does not have artefact, that negative effect OCR engine) as input for OCR, you can use set textonly to true. Tesseract will create only pdf with text layer and than you can use other tools to embed there e.g. jpg image
Example of tesseract searchable pdf
Example of tesseract searchable pdf with option textonly = true
Comments
Post a Comment