Tesseract LSTM training (aka Makefile training) on Raspberry Pi

"Makefile training" is example of training from existing data (set of image& ground truth file).

This tutorial does not cover how to create and prepare training data - this step is crucial for good OCR results.

Install general tools

sudo apt update
sudo apt install make wget bash unzip bc python3 byobu

Note: byobu (text-based window manager and terminal multiplexer) is not necessarily needed for training - but it is very useful when you start training from a remote computer.

Installing the latest tesseract on Raspberry Pi

If you use Debian-based OS (e.g. Raspbian), no current Tesseract version is available by default (due to Debian's strict update policy). However, you can use the notesalexp repository (for Debian and Ubuntu) to get the latest stable version:

sudo apt install apt-transport-https
sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak-$(date +%Y%m%d)
echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update

sudo apt-get install tesseract-ocr

Install training tool and needed langdata

git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
make tesseract-langdata

Used example data from training and run training

unzip -qq -d data/foo-ground-truth ocrd-testset.zip
byobu
make training 2>&1 | tee training.log

and wait... If you started the training in a Byobu session, close the terminal/connection and return later. This example training should end after 1.5 hours (on a Raspberry Pi 4 with SSD hard drive) with this message:


2 Percent improvement time=982, best error was 2.78 @ 4928
At iteration 5910/9300/9300, Mean rms=0.454000%, delta=0.231000%, BCER train=0.754000%, BWER train=2.948000%, skip ratio=0.000000%,  New best BCER = 0.754000 wrote checkpoint.

At iteration 5962/9400/9400, Mean rms=0.483000%, delta=0.290000%, BCER train=0.950000%, BWER train=3.551000%, skip ratio=0.000000%,  New worst BCER = 0.950000 wrote checkpoint.

At iteration 5983/9500/9500, Mean rms=0.498000%, delta=0.321000%, BCER train=1.015000%, BWER train=3.811000%, skip ratio=0.000000%,  New worst BCER = 1.015000 wrote checkpoint.

At iteration 6032/9600/9600, Mean rms=0.535000%, delta=0.379000%, BCER train=1.213000%, BWER train=4.431000%, skip ratio=0.000000%,  New worst BCER = 1.213000 wrote checkpoint.

At iteration 6080/9700/9700, Mean rms=0.586000%, delta=0.461000%, BCER train=1.504000%, BWER train=5.366000%, skip ratio=0.000000%,  New worst BCER = 1.504000 wrote checkpoint.

At iteration 6109/9800/9800, Mean rms=0.607000%, delta=0.490000%, BCER train=1.592000%, BWER train=5.748000%, skip ratio=0.000000%,  New worst BCER = 1.592000 wrote checkpoint.

At iteration 6132/9900/9900, Mean rms=0.623000%, delta=0.517000%, BCER train=1.668000%, BWER train=6.021000%, skip ratio=0.000000%,  New worst BCER = 1.668000 wrote checkpoint.

At iteration 6164/10000/10000, Mean rms=0.646000%, delta=0.552000%, BCER train=1.795000%, BWER train=6.461000%, skip ratio=0.000000%,  New worst BCER = 1.795000 wrote checkpoint.

Finished! Selected model with minimal training error rate (BCER) = 0.754
lstmtraining \
--stop_training \
--continue_from data/foo/checkpoints/foo_checkpoint \
--traineddata data/foo/foo.traineddata \
--model_output data/foo.traineddata
Loaded file data/foo/checkpoints/foo_checkpoint, unpacking...

Note

This tutorial does not imply that the Raspberry Pi is the optimal/sufficient hardware for Tesseract training. However, this tutorial shows that you can do Tesseract training also on such hardware - somewhere 24/7…

ramblings