-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Description
Current Behavior
Using tesseract to OCR simple English text. Scanned document at 600 dpi. Cleaned up to be pure black and white pixels. Invoking tesseract with command:
tesseract --psm 6 page.tiff page.box makebox
I've attached images to show the types of box errors that I get. Mainly it seems to be a set of three boxes, two seem okay, but then a third overlaps half of the other two boxes, often with a box which is clearly too tall -- the image has an "e" followed by an "n", and both have reasonable boxes, but then there is a third that is full-height, and overlaps the right half of the "e" and the left half of the "n". Same sort of thing for the trailing "he" in "the". Even does this for a text input that "looks" almost identical to another which it gets right.
Expected Behavior
I would expect that each box actually fits the character it is associated with, and other than some kerning situations, boxes don't overlap.
Suggested Fix
No response
tesseract -v
tesseract 5.5.0
leptonica-1.85.0
libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 3.1.0) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1.zlib-ng : libwebp 1.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libcurl/8.11.1 OpenSSL/3.2.6 zlib/1.3.1.zlib-ng brotli/1.1.0 libidn2/2.3.8 libpsl/0.21.5 libssh/0.11.3/openssl/zlib nghttp2/1.64.0 OpenLDAP/2.6.10
Operating System
No response
Other Operating System
Fedora Core 42 Linux
uname -a
Linux jkl 6.16.8-200.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 19 17:47:18 UTC 2025 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response