TextRecognize giving corrupted results on some images - Bug CASE 4075220

Question

Maybe the question sounds - is this a Mathematica error or what do I understand wrongly?

Im having troubles to learn using Textrecognize because Im getting repeatedly errors which when I analyse where those arrise I find that the problem is somwhere in the picture (text scan) or probably somwhere in TextRecognize itself. Here is example of the problem which I see as being inside TextRecognize itself.

I will use a scanned text which seems to me that the TextRecognize does not like and will let analyze only portion of the text in the area which corresponds to the 5th line of text (in other words the third line of the main paragraph):

image = Import["https://i.stack.imgur.com/MJKf0.png"]

TextRecognize[image,"Character",{"BoundingBox", "Text"},Masking -> Rectangle[{81, 103}, {1038, 129}]]

which gives such wrong result: - there are 58 items of BoundingBox but 61 items of Text. No wonder that this makes trouble further.

This is result on my Pi3 with Mathematica 11.2.0.0 . Similar problem is when trying to recognize content of the very first row ("1."). When I let recognize whole area of that picture the result is OK and there are no problems with this fith line of text.

Mathematica's TextRecognize is very weak. I suggest you to use Tesseract directly. — Alexey Popkov, Jun 03 '18 at 18:07
@Alexey Popkov - thats pity, can I use Tesseract through Matematica? — CJoe, Jun 03 '18 at 18:49
@Alexey Popkov - and is this example about weekness or about an error inside TextRecognize? My feeling is that getting different number of BoundinBox items comparing to number of Text items means there is a mistake in the TextRecognize code, am I wrong? — CJoe, Jun 03 '18 at 19:44
Mathematica's TextRecognize uses Tesseract under the hood, but as one can see it doesn't do it well... — Alexey Popkov, Jun 03 '18 at 20:01
@Alexey Popkov - yes I know, I meant whether there is a way to use Tesseract from Mathematica "directly" — CJoe, Jun 03 '18 at 20:23
Probably there is a way, but it isn't straightforward (and probably quite complicated). You may ask a separate question on it. Obviously the developers wasn't able to do it well... — Alexey Popkov, Jun 03 '18 at 21:42
As a workaround, you can filter the regions afterwards: Select[res, RegionMember[Rectangle[{81, 103}, {1038, 129}], RegionCentroid[#[[1]]]] &]. — Batracos, Jun 04 '18 at 07:37
@Alexey Popkov - so can I close this question as this is a bug in fact? Shall I report it somewhere at Wolfram? — CJoe, Jun 04 '18 at 08:28
@CJoe I wouldn't tag the question as a bug unless tech support confirms that. And I recommend reporting to support@wolfram.com. — Alexey Popkov, Jun 04 '18 at 08:51
I reported it to support@wolfram.com and on 8th June they confirmed it under CASE 4075220 — CJoe, Jun 11 '18 at 08:22
@ΑλέξανδροςΖεγγ - its Czech language (but the problem is not directly in a specific language) — CJoe, Sep 02 '18 at 08:47

TextRecognize giving corrupted results on some images - Bug CASE 4075220

0 Answers0