I'm working on a system that uses database content to produce PDFs in a variety of languages. I'm currently working on Armenian.
What we've developed so far uses pdfLaTeX. I made a special encoding to handle Armenian and got it all working, only to discover that the five ligatures in FreeSerif that had underscores in the glyph names ended up not being copyable or searchable. For example, copying the ligature glyph m_n_armenian ends up with "mn," and \pdfglyphtounicode doesn't seem able to override the default behavior of treating the ASCII letters separated by underscores as being the desired target characters for copying and pasting.
I explored using XeLaTeX, but found that colored footnotes spanning multiple pages lose their color on the second page. pdfcolfoot fixes this problem in pdfLaTeX, but the required color stacks aren't available in XeLaTeX.
Having just discovered that LuaLaTeX doesn't have this limitation, I'm checking to see if it will do what I want it to do. Given that LuaLaTeX seems to be the successor to pdfLaTeX, we should probably move that direction if LuaLaTeX can do everything we need done.
\documentclass[14pt]{memoir}
\usepackage{fontspec}
\setmainfont[Script=Armenian]{Freeserif}
\pdfcompresslevel=0
\begin{document}
ﬓ ﬔ ﬕ ﬖ ﬗ
մն մե մի վն մխ
\end{document}
The first line of text above contains the five Armenian ligatures in question, unicode characters FB13 through FB17. The second line are the total of ten characters that make up these five ligatures, from the range 0500 to 05FF. Compiling this with LuaLaTeX, I end up with two lines of the five ligatures displayed correctly. But when I copy and paste, I end up with two lines of five ligatures, not the ten characters.
I want to override this default behavior. I want the cmap in the PDF to specify:
<1BF9> <05740576>
<1BFA> <05740565>
<1BFB> <0574056B>
<1BFC> <057E0576>
<1BFD> <0574056D>
instead of:
<1BF9> <1BFD> <FB13>
But I can't figure out how to alter the cmaps at all via LuaLaTeX. I've found http://www.luatex.org/svn/branches/0.70.x/source/texk/web2c/luatexdir/font/tounicode.w but I'm not sure that the functions there are accessible via macros.
I think I need to keep the content of the PDFs identical to the content of the database. Otherwise, someone may take content from the database and search for it within a PDF, and never be able to find it since the ten characters when paired up are replaced by the ligature characters.
What should I do? Can the cmaps be tinkered with via LuaLaTeX? I have a much more difficult language to work with next, and I think I'm going to need this level of control over the cmaps.