Alphabet recognition

Question

I have a picture with some writings in it. It is written by computer in to me unknown alphabet. Is there a way how to use Mathematica to tell me what alphabet was used or what language? Here are two versions of the same text written in to me unknown alphabets.

To solve your immediate problem: it's written in Lao : https://unicode-table.com/en/blocks/lao --- no in fact, it's closer to Thai https://unicode-table.com/en/blocks/thai/ — flinty, Jun 13 '20 at 14:14
The first one: img = Import["https://i.stack.imgur.com/j9NXm.jpg"]; TextRecognize[img, Language -> "Thai"] gives ยาวเกานิว but this is slightly incorrect. According to google translate it should be "ยาวเก้า นิ้ว" which translates to Nine Inches Long — flinty, Jun 13 '20 at 14:23
Thank you very much. How did you identify it? I know Mathematica can recognize text, but you have to tell the language. I need the opposite - to let the Mathematica tell me the language. — azerbajdzan, Jun 13 '20 at 14:25
I just knew it was Thai :) https://www.youtube.com/watch?v=tvOgIqYroiA . As for the recognizing the scripts automatically, I'll look into that. — flinty, Jun 13 '20 at 14:30
So I used google translator to identify single words: long=ยาว nine=เก้า inch=นิ้ว That is all from linguistics point of view. Now I am more interested in Mathematica code, that would identify it for me. — azerbajdzan, Jun 13 '20 at 14:54
Mathematica won't do it without knowing the language in advance. I also didn't find anything in the Wolfram Neural Net Repository. You would have to train your own. However, I've been doing some research and there's not much on whole-unicode OCR script recognition. You may want to look into this https://github.com/tesseract-ocr/tesseract , a prebuilt windows installer exists too https://github.com/UB-Mannheim/tesseract/wiki. Make sure to select additional languages/scripts when you install. — flinty, Jun 13 '20 at 15:08
It’s worth noting that Tesseract is what is being used by TextRecognize in flinty’s answer, so unless there is some special argument or something that cannot be given to Tesseract through the Mathematica interface, TextRecognize should be sufficient. — C. E., Jun 13 '20 at 16:14
@azerbajdzan Have a look at flinty’s comment just before mine. — C. E., Jun 13 '20 at 17:40

score 8 · Accepted Answer · answered Jun 13 '20 at 15:40

Here's the list of all languages supported by TextRecognize in v12.1.

languages = {"Afrikaans", "Albanian", "Azerbaijani", "Belarusian", "Bosnian", 
  "Bulgarian", "Catalan", "Cebuano", "ChineseSimplified", 
  "ChineseTraditional", "Croatian", "Czech", "Danish", "Dutch", 
  "English", "Esperanto", "Estonian", "Finnish", "French", "Galician",
   "Georgian", "German", "Greek", "Haitian", "Hungarian", "Icelandic",
   "Indonesian", "Irish", "Italian", "Japanese", "Kazakh", "Kirghiz", 
  "Korean", "Lao", "Latin", "Lithuanian", "Macedonian", "Malay", 
  "Norwegian", "Polish", "Portuguese", "Romanian", "Russian", 
  "Serbian", "Slovak", "Slovenian", "Spanish", "Swahili", "Swedish", 
  "Tajik", "Turkish", "Ukrainian", "Uzbek", "Vietnamese", "Welsh"};

This will take a long time to execute the first time it downloads all the languages, so I recommend you remove languages from the list that you know aren't relevant. The code below will recognize your text and produce a list of pairs of the form {text, strength}, where strength tells you if it's a good match:

img = Import["https://i.stack.imgur.com/j9NXm.jpg"];
{#, TextRecognize[img, "Line", {"Text","Strength"}, Language -> #]}&/@languages;

I slimmed down the list of languages to demonstrate:

results = {#, TextRecognize[img, "Line", {"Text", "Strength"},
 Language -> #]} & /@ {"English", "French", "Japanese", "Lao", "Thai"}

(**
   English  {gyaaniia,0.}
   French   {NN,0.}
   Japanese {ココ!せっ,0.15696}
   Lao      {ປາງເຄານິວ,0.610667}
   Thai     {ยาวเกานิว,0.941698}
**)

You could select the best one using: First[MaximalBy[results, #[[2, 2]] &]] which gives you:

{"Thai", {"ยาวเกานิว", 0.941698}}

Alphabet recognition

1 Answers1