0

I'm doing a project that converts science formula images into Latex strings.

During the development, I found that with the same formula, we have different ways to conduct its Latex string.

For example:

  1. \left(A\right)\frac{125}{300};
  2. \text { (A) } \frac{125}{300} \text {; }
  3. (A)\frac{125}{300};

Three Latex strings above describe the same following mathematical formula image: enter image description here

Are there any ways to convert all different Latex styles into one single format? If possible, I can evaluate LatexOCR with CER/WER metrics, accuracy, ... or compare between different API services more precisely and conveniently.

Updated: In my case, what I currently want is to compare relatively the rendered outputs between different OCR API services automatically. However, it will be impossible if I only lay on its Latex values (because of the difference I have listed above). Of course, when I develop models and evaluate in-house solutions, all details matter.

nguyendhn
  • 101
  • 1
  • 4
    they don't describe the same formula, a text A has a different meaning to a math A. They not even look the same, e.g. the spacing will be different and the A will by upright or italic. – Ulrike Fischer Nov 14 '21 at 09:28
  • 2
    if you mean you are generating latex by OCR of images then you shoul davoid generating the first two as the third is the correct form in almost all cases. – David Carlisle Nov 14 '21 at 10:01
  • @UlrikeFischer I understand the problem you have mentioned above. But in my case, what I currently want is to compare relatively the rendered outputs between different OCR API services automatically. However, it will be impossible if I only lay on its Latex values (because of the difference I have listed above). Of course, when I develop models and evaluate in-house solutions, all details matter. – nguyendhn Nov 17 '21 at 02:58
  • (by the way there are similar projects for OCRing LaTeX – inftyreader / mathpix for comparison, see also https://tex.stackexchange.com/questions/8503/how-to-convert-pdf-to-latex ) – user202729 Nov 17 '21 at 04:25
  • For this one it seems that the proper way is to use enumitem package to generate (A), (B) etc. automatically. // — In the general case there obviously isn't any way (because LaTeX is a programming language which allows very complex structure), if you restrict to a subset of LaTeX then it might be possible (but you have to define exactly what "equivalent" mean), but I don't think any such program exists yet. (although there are some for parsing (a subset of) LaTeX from other programming languages such as Python etc.) – user202729 Nov 17 '21 at 04:27

0 Answers0