Preparing Windows for Tesseract "Makefile training" (LSTM training)
The Tesseract Makefile training/LSTM training from existing images was created on Linux (unix like) system using the usual Unix tools.
Fortunately, many these tools can also be installed on Windows.
Here are the suggested steps:
1. Install Tesseract
The easiest way is to use the tesseract installer for Windows from
the University of Mannheim, which also contains (cross) compiled training
tools:
https://digi.bib.uni-mannheim.de/tesseract
Use the latest version!
2. Install Python
Use the actively supported
Python 3 If you plan to
use Python for other projects, it might make sense to avoid the latest major
version (currently 3.11) as some packages might not be adapted to the new major
version (e.g. pytoch, tensorflow). I would suggest to use 3.10.x (for
tesseract training 3.11.x is fine as far as I know).
3. Install Git for Windows
Git SCM for Windows provides not
only Git but also a lot of other Linux programs that are used during Makefile
training such as
find, unzip, rm
. Even during the installation, if you
use option to add git to your PATH, you will add one extra directory to your PATH:
C:\Program Files\Git\usr\bin
(in case of 64bit installation).
Unfortunately, there are several Windows tools with the same name as in Linux
(
`find`, `sort`
) with different behaviour/functionality, so it is necessary to avoid them during training: You need to put the git directory
at the beginning of the PATH variable.You can also temporarily do it in
cmd
(it only effect the
commands run in this cmd session) with:
set PATH=C:\Program Files\Git\usr\bin;%PATH%
4. Install winget
The Windows Package Manager is also great tool for installing and updating many Windows programs from
the command line (there is also a user interface for it: WingetUI). After installation, run:
winget install GnuWin32.Make
winget install wget
It will install
make
and wget
tools needed for training.I highly recommend installing and using Windows Terminal as an alternative to Windows cmd:
winget install Microsoft.WindowsTerminal
5. Install Bc and dc calculator in Windows
bc, is a basic calculator (often referred to as bench calculator), is "an arbitrary-precision calculator language". The Windows version can be downloaded from embedeo.org (e.g. wget https://embedeo.org/ws/command_line/bc_dc_calculator_windows/bc-1.07.1-win32-embedeo-02.zip
)
You will need only bc.exe so you can export it somewhere to your path e.g. (in my case)
unzip -j
bc-1.07.1-win32-embedeo-02.zip "bc-1.07.1-win32-embedeo-02/bin/bc.exe" -d
"c:\Program Files\Tools"
6. Install tesstrain and test training
set PATH=C:\Program Files\Git\usr\bin;%PATH%
git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
make tesseract-langdata
mkdir tessdata_best
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000
Comments
Post a Comment