How do I customize a LuaLaTeX cmap?

Question

I'm working on a system that uses database content to produce PDFs in a variety of languages. I'm currently working on Armenian.

What we've developed so far uses pdfLaTeX. I made a special encoding to handle Armenian and got it all working, only to discover that the five ligatures in FreeSerif that had underscores in the glyph names ended up not being copyable or searchable. For example, copying the ligature glyph m_n_armenian ends up with "mn," and \pdfglyphtounicode doesn't seem able to override the default behavior of treating the ASCII letters separated by underscores as being the desired target characters for copying and pasting.

I explored using XeLaTeX, but found that colored footnotes spanning multiple pages lose their color on the second page. pdfcolfoot fixes this problem in pdfLaTeX, but the required color stacks aren't available in XeLaTeX.

Having just discovered that LuaLaTeX doesn't have this limitation, I'm checking to see if it will do what I want it to do. Given that LuaLaTeX seems to be the successor to pdfLaTeX, we should probably move that direction if LuaLaTeX can do everything we need done.

\documentclass[14pt]{memoir}
\usepackage{fontspec}
\setmainfont[Script=Armenian]{Freeserif}
\pdfcompresslevel=0
\begin{document}
ﬓ ﬔ ﬕ ﬖ ﬗ 

մն մե մի վն մխ
\end{document}

The first line of text above contains the five Armenian ligatures in question, unicode characters FB13 through FB17. The second line are the total of ten characters that make up these five ligatures, from the range 0500 to 05FF. Compiling this with LuaLaTeX, I end up with two lines of the five ligatures displayed correctly. But when I copy and paste, I end up with two lines of five ligatures, not the ten characters.

I want to override this default behavior. I want the cmap in the PDF to specify:

<1BF9> <05740576>
<1BFA> <05740565>
<1BFB> <0574056B>
<1BFC> <057E0576>
<1BFD> <0574056D>

instead of:

<1BF9> <1BFD> <FB13>

But I can't figure out how to alter the cmaps at all via LuaLaTeX. I've found http://www.luatex.org/svn/branches/0.70.x/source/texk/web2c/luatexdir/font/tounicode.w but I'm not sure that the functions there are accessible via macros.

I think I need to keep the content of the PDFs identical to the content of the database. Otherwise, someone may take content from the database and search for it within a PDF, and never be able to find it since the ten characters when paired up are replaced by the ligature characters.

What should I do? Can the cmaps be tinkered with via LuaLaTeX? I have a much more difficult language to work with next, and I think I'm going to need this level of control over the cmaps.

While waiting for an answer here, I've started working on Devanagari, only to discover that LuaLaTeX doesn't presently support the correct placement of vowels in that script. If I can't use LuaLaTeX for all the languages, I guess I'm open to a non-LuaLaTeX solution to my Armenian problem, such as a way to correct the cmap when using glyphtounicode.tex. By the way, the DB content is all in UTF-8. — Pickle, Oct 01 '12 at 02:16
... such as a way to correct the cmap when using glyphtounicode.tex in pdfLaTeX. — Pickle, Oct 01 '12 at 02:24

Pickle · Accepted Answer · 2012-10-18T16:56:59.487

In the preamble, include: \usepackage{luacode}. Just before \end{document}, include the following:

\begin{luacode*}
    tounicodevalues = {
        [64275] = "05740576",
        [64276] = "05740565",
        [64277] = "0574056B",
        [64278] = "057E0576",
        [64279] = "0574056D",
    }
    for i,f in font.each() do
        if (string.match(f.name, "FreeSerif") and string.match(f.name, "script=armn")) then
            for u, v in next, tounicodevalues do
                f.characters[u].tounicode = v
            end
            font.fonts[i] = f
        end
    end
\end{luacode*}

Why before \end{document}? Because you want to do this after LuaLaTeX already knows what all of the fonts are that it will actually be using.

Why restrict the routine to certain matching fonts? This avoids the error of trying to change fonts that have already been accessed, such as line10, which is forbidden.

This code could be wrapped in a TeX macro and placed in the preamble, and then called just before \end{document}. This example doesn't use the luacode package:

\newcommand{\tounicode}[2][]{\directlua0{
    tounicodevalues = {
        [64275] = "05740576",
        [64276] = "05740565",
        [64277] = "0574056B",
        [64278] = "057E0576",
        [64279] = "0574056D",
    }
    for i,f in font.each() do
        if (string.match(f.name, "#1") and string.match(f.name, "#2")) then
            for u, v in next, tounicodevalues do
                f.characters[u].tounicode = v
            end
            font.fonts[i] = f
        end
    end
}}

\begin{document}

ﬓ ﬔ ﬕ ﬖ ﬗ

\tounicode[script=armn]{FreeSerif}
\end{document}

Using fontspec, I added the font feature, HyphenChar={1418}, which changes the hyphenation character to the Armenian hyphen. Doing so results in the altering-already-accessed-font error when using the above function, unless additional matching criteria is added using an optional argument such as "script=armn."

To find out what the font names are that you are trying to match, add the following code to the above function.

    for i,f in font.each() do
        texio.write_nl("["..i.."] => "..f.name)
    end

score 3 · Answer 2 · answered Oct 04 '12 at 09:37

3

LuaTeX's mechanism for defining fonts offers the following route to constructing the /ToUnicode entry. The following text is taken from section 7 Font structure (page 153 of the LuaTeX manual for beta 0.71.0):

The usage of tounicode is this: if this font specifies a tounicode=1 at the top level, then LuaTEX will construct a /ToUnicode entry for the pdf font (or font subset) based on the character-level tounicode strings, where they are available. If a character does not have a sensible Unicode equivalent, do not provide a string either (no empty strings). If the font-level tounicode is not set, then LuaTEX will build up /ToUnicode based on the TEX code points you used, and any character-level tounicodes will be ignored. At the moment, the string format is exactly the format that is expected by Adobe CMap files (utf-16BE in hexadecimal encoding), minus the enclosing angle brackets. This may change in the future. Small example: the tounicode for a fi ligature would be 00660069.

answered Oct 04 '12 at 09:37

Graham Douglas

291

For pdfLaTeX I modified the FreeSerif font so that the names of the Armenian ligature glyphs no longer contained underscores, and that seemed to do the trick. However, copying and pasting results in a loss of spaces before and after one ligature in normal weight, and all ligatures in bold. I can't find a workaround, and LuaLaTeX doesn't have this problem, from what I can tell. Thus I'm back to trying to disable the cmap in LuaLaTeX. So how do I set the tounicode value to 0? Or what do I do to the font to force Lua to set that value to 0? – Pickle Oct 05 '12 at 20:04
Why would a document compiled under pdfLaTeX lose spaces between ﬗ's leading and trailing characters when copying or searching, but the same document compiled under LuaLaTeX not have this problem? – Pickle Oct 07 '12 at 03:44
LuaLaTeX loses spaces too if I am using an 8-bit font via luainputenc, but not if I use the very same font as a unicode font. The only difference between the two fonts is that I used autoinst (otftotfm) and a custom encoding to create PFB and TFM files from the OTF file. Why are the spaces between words not lost if I use the straight OTF, but are lost if I use the PFB and TFM via luainputenc? – Pickle Oct 08 '12 at 21:17
I now understand the citation you gave. If I change the top-level tounicode value to 0, then the TeX code points will be used, which is what is already happening. Thus I need to alter the character-level tounicode values and keep the top-level tounicode value at 1. How to actually change the character-level tounicode values I will place in a separate answer. – Pickle Oct 11 '12 at 18:27

How do I customize a LuaLaTeX cmap?

2 Answers2

Linked