Either batch transcode (character sets) a whole list of image filenames ... or use a character encoding conversion function within the main Tex file?

Question

I am very excited about this code by User @egreg which I will be using to display images of the stroke order of a Chinese character whenever you hover over a certain character.

I changed the code a little bit for my use-case, to end-up with something as such (note I used the font HanWangKaiMediumChuIn_wp010-08.ttf, so one needs to download it in case one wishes to use the font):

\documentclass[a6paper,12pt]{scrbook}

\usepackage{fontspec}
\setmainfont{HanWangKaiMediumChuIn_wp010-08}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% tooltips with LaTeX
%
% optimized for Adobe Reader (visible on mouse-over)
%     usage: \tooltip[<link colour>]{<link text>}[<tip box colour>]{<tip text>}
%   non-draggable version:
%     usage: \tooltip*[<link colour>]{<link text>}[<tip box colour>]{<tip text>}
%
% for Evince (visible on click, not draggable)
%   usage: \tooltip**[<link colour>]{<link text>}[<tip box colour>]{<tip text>}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{pdfbase}[2017/03/16]
\usepackage{xparse,ocgbase}
\usepackage{xcolor,calc}
\usepackage{tikz}
\usetikzlibrary{calc}

\ExplSyntaxOn
\let\tpPdfLink\pbs_pdflink:nn
\let\tpPdfAnnot\pbs_pdfannot:nnnn\let\tpPdfLastAnn\pbs_pdflastann:
\let\tpAppendToFields\pbs_appendtofields:n
\def\tpPdfXform{\pbs_pdfxform:nnnnn{1}{1}{}{}}
\let\tpPdfLastXform\pbs_pdflastxform:
\ExplSyntaxOff

\makeatletter
\NewDocumentCommand{\tooltip}{ssO{blue}mO{yellow!20}m}{{%
  \leavevmode%
  \IfBooleanT{#1}{%
    \ocgbase@new@ocg{tipOCG.\thetcnt}{%
      /Print<</PrintState/OFF>>/Export<</ExportState/OFF>>%
    }{false}%
    \xdef\tpTipOcg{\ocgbase@last@ocg}%
  }%
  \tpPdfLink{%
    \IfBooleanTF{#2}{%
      /Subtype/Link/Border [0 0 0]/A <</S/SetOCGState/State [/Toggle \tpTipOcg]>>
    }{%
      /Subtype/Screen%
      \IfBooleanTF{#1}{%
        /AA<<%
          /E<</S/SetOCGState/State [/ON \tpTipOcg]>>%
          /X<</S/SetOCGState/State [/OFF \tpTipOcg]>>%
        >>%
      }{
        /AA<<%
          /E<</S/JavaScript/JS(%
            var fd=this.getField('tip.\thetcnt');%
            \IfBooleanF{#1}{%
              if(typeof(click\thetcnt)=='undefined'){%
                var click\thetcnt=false;%
                var fdor\thetcnt=fd.rect;var dragging\thetcnt=false;%
              }%
            }%
            if(fd.display==display.hidden){%
              fd.delay=true;fd.display=display.visible;fd.delay=false;%
            }%
           this.dirty=false;%
          )>>%
          /X<</S/JavaScript/JS(%
            if(!click\thetcnt&&!dragging\thetcnt){fd.display=display.hidden;}%
            if(!dragging\thetcnt){click\thetcnt=false;}%
            this.dirty=false;%
          )>>%
          /U<</S/JavaScript/JS(click\thetcnt=true;this.dirty=false;)>>%
          /PC<</S/JavaScript/JS (%
            var fd=this.getField('tip.\thetcnt');%
            try{fd.rect=fdor\thetcnt;}catch(e){}%
            fd.display=display.hidden;this.dirty=false;%
          )>>%
          /PO<</S/JavaScript/JS(this.dirty=false;)>>%
        >>%
      }
    }%
  }{{\color{#3}#4}}%
  \sbox\tiptext{\fcolorbox{black}{#5}{#6}}%
  \edef\twd{\the\wd\tiptext}%
  \edef\tht{\the\ht\tiptext}%
  \edef\tdp{\the\dp\tiptext}%
  \tpPdfXform{\tiptext}%
  %tip box placed at top left page corner
  \begin{tikzpicture}[remember picture,overlay]
    \node [inner sep=0pt, anchor=base] at (current page.north west) {%
      \raisebox{-\tht}[0pt][0pt]{%
        \tpPdfAnnot{\twd}{\tht}{\tdp}{%
          /Subtype/Widget/FT/Btn/T (tip.\thetcnt)%
          /AP<</N \tpPdfLastXform>>%
          /MK<</TP 1/I \tpPdfLastXform/IF<</S/A/FB true/A [0.0 0.0]>>>>%
          \IfBooleanTF{#1}{%
            /Ff 65537/OC \tpTipOcg%
          }{%
            /Ff 65536/F 3%
            /AA <<%
              /U <<%
                /S/JavaScript/JS(%
                  var fd=event.target;%
                  var mX=this.mouseX;var mY=this.mouseY;%
                  var drag=function(){%
                    var nX=this.mouseX;var nY=this.mouseY;%
                    var dX=nX-mX;var dY=nY-mY;%
                    var fdr=fd.rect;%
                    fdr[0]+=dX;fdr[1]+=dY;fdr[2]+=dX;fdr[3]+=dY;%
                    fd.rect=fdr;mX=nX;mY=nY;%
                  };%
                  if(!dragging\thetcnt){%
                    dragging\thetcnt=true;Int=app.setInterval("drag()",1);%
                  }%
                  else{app.clearInterval(Int);dragging\thetcnt=false;}%
                  this.dirty=false;%
                )%
              >>%
            >>%
          }%
        }%
        \tpAppendToFields{\tpPdfLastAnn}%
      }%
    };
  \end{tikzpicture}
  \stepcounter{tcnt}%
}}
\makeatother
\newsavebox\tiptext\newcounter{tcnt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\ExplSyntaxOn
\NewDocumentCommand{\tooltips}{sO{}m+O{}}
 {
  \tl_map_inline:nn { #3 }
   {
    \IfBooleanTF{#1}{\tooltip*}{\tooltip*}{##1}{\includegraphics[#2]{images/##1}}#4
   }
 }
\ExplSyntaxOff

\begin{document}

\tooltips[scale=1]{高一生}

\end{document}

Another note: be sure to compile twice.

The code however, is designed to look for an image of the same name as the character. For example, if one would hover over the character 高, the code currently looks for the image 高.png (in the folder images).

Now, this is designed to work with a large set of Chinese characters. In fact, I have the required images of thousands of such characters. Unfortunately, the images are not named e.g. 高.png, not even U+9AD8.png (the UCS of that Character), but it is now named B0AA.png, which is the Big-5 encoding of that character.

So, that leaves me with two options to make this code workable.

Either, I have to batch convert all of the names of each and every .png into their respective Chinese characters (for example B0AA.png into 高.png). But I wouldn't know how to do this. Is there a way to do this with Tex?
Alternatively. How could I adapt the code above so that the code will sort of "convert" each character in \tooltips[scale=1]{高一生} into the Big-5 encoding but only when looking for the respective image of each character (so that \tooltips[scale=1]{高一生} will still display 高一生 but look for the images B0AA.png, A440.png and A5CD.png respectively).

Anybody offering a solution please?

Please note that I do have a .txt file available which might be used as a conversion table, as it lists all the necessary characters as such:

UCS Big5
UCS Big5
UCS Big5
... ...

As for the purpose of a M(not yet-)WE example, one could use the following minimal .txt file, which contains the UCS (left) and Big5 (right) codes for 高一生:

U+9AD8  B0AA
U+4E00  A440
U+751F  A5CD

By the way, if you want to use the images I am using, you can find them at

http://stroke-order.learningweb.moe.edu.tw/words/B0AA.png

at http://stroke-order.learningweb.moe.edu.tw/words/A440.png

and at http://stroke-order.learningweb.moe.edu.tw/words/A5CD.png

That is, for the example case of 高一生. Be sure to put them in a folder called "images" next to your main .tex file.

Finally, this code is supposed to produce .pdf files to be viewed with Adobe Acrobat Reader. I don't need to be bound by a specific compiling method, but the only way I have been successfully able to compile this, is using LuaLaTeX. As per this answer by @David Carlisle, XeTeX might be an other option, but I can't figure out what minute changes I would have to make.

If your file compiles with lualatex (I haven't tried), then it may be easier to write this encoding conversion as a piece of Lua code. Of course in principle it can be done in TeX macros too, and as you already have a file for use as a conversion table, using that data may make the problem easier. — ShreevatsaR, Sep 20 '17 at 02:08
@ShreevatsaR Yes, exactly compiling is done with LuaLaTeX, let me just mention that in the OP. Would you have a possible solution? — O0123, Sep 20 '17 at 02:10
Does your .txt file contain the UCS <-> Big5 mapping for every character that you care about? Can you upload your .txt file somewhere and add the link here? — ShreevatsaR, Sep 20 '17 at 02:45
@ShreevatsaR Yes. Don't worry about the .txt file. To not have even more download links who might expire, I will propose a MWE code (for the characters 高一生) to be saved as a .txt file in the OP (first line: U+9AD8 B0AA, second line: U+4E00 A440, third line: U+751F A5CD) — O0123, Sep 20 '17 at 02:55
@cfr The nice thing about renaming the PNGs to Chinese characters is that they are then instantly human readable, even beyond this particular use case (that is of course for people with knowledge of the language). But if you think it's easy to rename the thousands of images to their UCS encoding, I am all ears to learn how. — O0123, Sep 20 '17 at 02:59
@cfr Just for simplicity, perhaps you could delete the comment starting with It won't work. You can't ping people from posts as it is not directly related to the main question, please? — O0123, Sep 20 '17 at 05:37

ShreevatsaR · Accepted Answer · 2017-10-01T07:07:33.540

4

Of the two options, the better solution may be to rename all your files (e.g. rename B0AA.png to 高.png), assuming all your tools that work on these files are capable of handling such filenames, and assuming you are comfortable reading and distinguishing these filenames. But TeX would not be a good tool for such filesystem operations.

The other option is doable in TeX: you can convert each character to its Big5 encoding when you specify the filename in TeX, thus solving the problem. This sort of programming is doable even in TeX macros, but as you're using LuaTeX it becomes significantly easier.

Lua has no native understanding of Unicode; a UTF-8 encoded byte string is just a byte string to Lua. But this is not a problem; below I've written a simple UTF-8 decoder in Lua.

The following Lua and TeX files define a macro \bigfive so that you can type \bigfive{高} and it will expand (be equivalent) to B0AA, the Big5 encoding of that character. The Lua code can be put in the .tex file, but I prefer to put it into a separate file called big5.lua (say):

function ucsFromChar(s)
   -- UTF-8 decoder. Given a valid UTF-8-encoded character s, returns the number (codepoint) it encodes.
   -- A stricter version is at https://gist.github.com/shreevatsa/6aef61aafd4ccfe2149ddacdc5c5855d
   local n = string.byte(s, 1)
   if n < 128 then return n end -- Starts with a 0
   if n < 224 then              -- Starts with 110, so get remaining 5 bits, then 6 bits from next byte
      return (n - 192) * 64 + string.byte(s, 2) - 128
   elseif n < 240 then          -- Starts with 1110, so get remaining 4 bits, then 6 bits each from next 2 bytes
      return ((n - 224) * 64 + string.byte(s, 2) - 128) * 64 + string.byte(s, 3) - 128
   else                         -- Starts with 11110, so get remaining 3 bits, then 6 bits each from next 3 bytes
      return (((n - 240) * 64 + string.byte(s, 2) - 128) * 64 + string.byte(s, 3) - 128) * 64 + string.byte(s, 4) - 128
   end
end

map = nil  -- Populated and used by getMap() below.
function getMap()
   -- Returns map from a file containing a space-separated key-value pair per line.
   if map ~= nil then return map end
   map = {}
   for line in io.open('map.txt', 'r'):lines() do
      local key = nil
      for s in string.gmatch(line, "%S+") do
         if key ~= nil then map[key] = s else key = s end
      end
   end
   return map
end

function big5FromChar(s)
   local u = string.upper(string.format('U+%x', ucsFromChar(s)))
   -- return getMap()[u]
   local v = getMap()[u]
   if v ~= nil then return v else return s end
end

This assumes that (as stated in the question) you have a map.txt file with lines like:

U+9AD8  B0AA
U+4E00  A440
U+751F  A5CD

With this, the .tex file can be just:

\documentclass{article}
\usepackage{fontspec}
\setmainfont{AppleGothic}

\directlua{dofile('big5.lua')}
\newcommand{\bigfive}[1]{\directlua{tex.sprint(big5FromChar('#1'))}}

\begin{document}
Anywhere in my document I can use the \verb+\bigfive+ macro.

For example, 高 is \bigfive{高} in big5.
\end{document}

The above produces:

edited Oct 01 '17 at 07:07

answered Sep 20 '17 at 06:23

ShreevatsaR

45,428
10
117
149

Very impressive. Thank you so much. If I change the line \IfBooleanTF{#1}{\tooltip*}{\tooltip*}{##1}{\includegraphics[#2]{images/##1}}#4 into \IfBooleanTF{#1}{\tooltip*}{\tooltip*}{##1}{\includegraphics[#2]{images/\bigfive{##1}}}#4, it does indeed compile when using e.g. \tooltips[scale=1]{高} or \tooltips[scale=1]{國字標準字體筆順} and many many more characters. Strangely, I found it doesn't compile when trying to use the character 一, even though its UCS and Big5 (namely U+4E00 A440) are in map.txt and I have an image named A440.png ... – O0123 Sep 20 '17 at 06:43
The same problem occurs when trying your code only (without the OPs code). For example \bigfive{高} compiles, so does \bigfive{生}, but somehow \bigfive{一} doesn't. – O0123 Sep 20 '17 at 06:55
Something as follows: (./OTHER.aux)(load luc: /Users/User/Library/texlife/2017/texmf-varluatex-cache/generic/fonts/otl/lmono10-regular.luc)big5.lua:4: Continuation byte must start with 10:128, stack traceback:, [C]: in function 'assert', big5.lua:4: in function 'getCBValue', big5.lua:16: in function 'ucsFromChar', big5.lua:45; in function 'big5fromChar', [\directlua]:1: in main chunk., \bigfive ...ctlua {tex.sprint(big5FromChar('#1'))}, l.13 For example, 一 is \bigfive{一}, in big5., ?. – O0123 Sep 20 '17 at 07:03
@VincentMiaEdieVerheyen Oops it was a bug in my UTF-8 decoder (128 < n should have been 128 <= n). Can you try the version that is in the answer now? – ShreevatsaR Sep 20 '17 at 07:07
Congratulations, you fixed it! So is it advisable to continue using your first (longer) revision but change 128 < n into 128 <= n, or continue with the later (shorter) revision? – O0123 Sep 20 '17 at 07:12
@VincentMiaEdieVerheyen Ideally both should be equivalent, after that bug is fixed. :-) You can use whichever works for you. – ShreevatsaR Sep 20 '17 at 07:13
So, it depends on whether or not you want "more error-checking and more debug output"? I see. – O0123 Sep 20 '17 at 07:17
I have asked a new question now about excluding punctuation marks. – O0123 Oct 01 '17 at 03:28
@VincentMiaEdieVerheyen I don't have time now to look into it deeply, but I've changed the lua big5FromChar function (see the last 4 lines of the big5.lua file in the answer) to, on seeing unknown characters like (, simply return them instead of returning a nil string because no translations were present for them in map.txt. See if this works for you. (Actually, even without this change, you may be able to get away with simply adding entries in map.txt for those characters; e.g. for ( you could add the line U+0028 ( in the map.txt file.) – ShreevatsaR Oct 01 '17 at 07:10
The problem is that your code is being used inside a \tooltip code, so even if your big5FromChar would return the same character, e.g. from ( to (, then that character would still try to be ran inside the \tooltip code, which is not desired. For these characters, there doesn't need to be a tooltip at all. They just need to be print out. – O0123 Oct 01 '17 at 07:19
Well I don't know what your \tooltip does, and I don't see it defined at the linked question either, so I give up for now. :-) Good luck. – ShreevatsaR Oct 01 '17 at 07:32
I am sorry, I should have been more precise and have said \tooltips instead, as per this answer. You can see my full code here. But, in fact, it might not matter how \tooltips is defined. What would be nice if there was a way to let input jump out of a command (e.g. \tooltips or \bigfive) when it is not wanted to be run through that command. – O0123 Oct 01 '17 at 07:46
Is there any way for a character to return itself in case in case its UCS code is not listed in the file map.txt please? See also the problems I have with your code at the new OP "What's so special about a and ㄅ". – O0123 Oct 05 '17 at 06:02
The last edit to my answer (on Oct 1) was precisely about that. Did you try it? Does it not work? (See the comment above) – ShreevatsaR Oct 05 '17 at 06:54
Wonderful, thanks. That solved the problem. – O0123 Oct 05 '17 at 07:34

Either batch transcode (character sets) a whole list of image filenames ... or use a character encoding conversion function within the main Tex file?

1 Answers1

Linked