Preparing Windows for Tesseract "Makefile training" (LSTM training)

The Tesseract Makefile training/LSTM training from existing images was created on Linux (unix like) system using the usual Unix tools. 

Fortunately, many these tools can also be installed on Windows.


Here are the suggested steps:

1. Install Tesseract

The easiest way is to use the tesseract  installer for Windows from the University of Mannheim, which also contains (cross) compiled training tools: https://digi.bib.uni-mannheim.de/tesseract
Use the latest version!

2. Install Python

Use the actively supported Python 3 If you plan to use Python for other projects, it might make sense to avoid the latest major version (currently 3.11) as some packages might not be adapted to the new major version (e.g. pytoch, tensorflow). I would suggest to use 3.10.x (for tesseract training 3.11.x is fine as far as I know).

3. Install Git for Windows

Git SCM for Windows provides not only Git but also a lot of other Linux programs that are used during Makefile training such as find, unzip, rm. Even during the installation, if you use option to add git to your PATH, you will add one extra directory to your PATH: C:\Program Files\Git\usr\bin (in case of 64bit installation).

Unfortunately, there are several Windows tools with the same name as in Linux (`find`, `sort`) with different behaviour/functionality, so it is necessary to avoid them during training: You need to put the git directory at the beginning of the PATH variable.


You can also temporarily do it in cmd (it only effect the commands run in this cmd session) with:
set PATH=C:\Program Files\Git\usr\bin;%PATH%

 4. Install winget

The Windows Package Manager is also great tool for installing and updating many Windows programs from the command line (there is also a user interface for it: WingetUI). After installation, run:
  winget install GnuWin32.Make
  winget install wget
It will install make and wget tools needed for training.
I highly recommend installing and using Windows Terminal as an alternative to Windows cmd:
  winget install Microsoft.WindowsTerminal

 5. Install Bc and dc calculator in Windows

bc, is a basic calculator (often referred to as bench calculator), is "an arbitrary-precision calculator language". The Windows version can be downloaded from embedeo.org  (e.g. wget https://embedeo.org/ws/command_line/bc_dc_calculator_windows/bc-1.07.1-win32-embedeo-02.zip)

You will need only  bc.exe so you can export it somewhere to your path e.g. (in my case)

unzip -j bc-1.07.1-win32-embedeo-02.zip "bc-1.07.1-win32-embedeo-02/bin/bc.exe" -d "c:\Program Files\Tools"


6. Install tesstrain and test training


  set PATH=C:\Program Files\Git\usr\bin;%PATH%
  git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
  cd tesstrain
  make tesseract-langdata
  mkdir tessdata_best
  wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
  unzip ocrd-testset.zip -d data/ocrd-ground-truth
  make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000
  


Comments

Popular posts from this blog

Tesseract LSTM training (aka Makefile training)