How are the glyph (character) names in PDF-files determined?

Question

PDF-files make internal use of glyph names. For example, the name of ≈ (U+2248; TeX \approx) appearing in a PDF-file might be approxequal.

One can find such names in a TeX-generated PDF-file by

compiling the TeX code with \pdfcompresslevel=0,
inspecting the resulting PDF-file as a text file, and
looking for lines starting with /CharSet.

(information taken from Ulrike Fischer's answer elsewhere, which provides more information).

Apparently the glyph names are font-dependent. So they are determined by the fonts? Do all font formats use such names? Which font formats use textual names? Do all glyphs in all PDF-files have such names?

How are the glyph names in PDF-files determined? Who determined the existing ones? What are they for? (Why doesn't PDF refer to the glyphs by number? Clearly some readers are relying on the glyph names (see link to question about hyperlink detection below), so the PDF format or some readers make some assumptions about these names. There must be a reason about why an intermediary of names is used. Perhaps this has to do with the age of Unicode in relation to PDF.) What else is there to know on this topic for a user of (La)TeX?

For me, the issue of PDF glyph names came up here:

Manipulating the Unicode codepoints of glyphs in the resulting PDF-file requires knowledge of the glyph names. Notably, glyphtounicode.tex maps from glyph names to Unicode codepoints, with lines such as \pdfglyphtounicode{approximatelyequal}{2245}: How to fix missing or incorrect mappings from glyphtounicode.tex
At least one PDF reader uses glyph names for a heuristic for HTTP URL detection: \input{glyphtounicode} with \pdfgentounicode=1 creates unwanted hyperlinks from link-like text

A similar question is How to find the proper glyph name required by \pdfglyphtounicode, but there is more ground that needs to be covered in this topic.

Well the (adobe) formats of pdf and type1 fonts (which uses glyph names too) are quite old. Which numbers should a type1 font have used before unicode? Beside this names are useful: If is (for a human) e.g. much easier to to handle meaningful names then numbers e.g. to write an encoding vector. And I only need to glance at the names is an afm-file to get an idea about the glyph coverages. And even unicode today doesn't use only numbers but gives all glyphs also sensible names. So the idea of adobe to use names to identify glyphs was imho quite natural. — Ulrike Fischer, Jul 14 '13 at 12:07

score 5 · Accepted Answer · answered Aug 25 '13 at 12:48

it's my understanding that the glyph names are determined by the font. (note use of the term "glyph"; characters and glyphs are related, but are not interchangeable. but that's another story.)

it's also my understanding that the names supplied by the font depend on the supplier of the font -- they may be "meaningful" in some way (e.g., an ascii letter, a unicode, a descriptive name, ...) or they may just be a supplier's internal code, as used to be the situation in the days of metal type (as shown in old monotype technical symbols listings).

things may change, but ... don't hold your breath.

adding to what ulrike has said, unicode also uses names as well as numbers. an important (but possibly irrelevant point) here is that, once both a name and a number are assigned, they are never changed, even should the name prove to be wrong, or just ill-advised.

a second point is that some glyphs are not necessarily named by a single unique unicode. a unicode is supposed to define meaning, not shape. "variant" glyphs (with the same meaning but different shape) may be represented by multiple unicodes, in two principal ways:

by using a combining diacritic, as \nvarleq is a compound of \leq (U+2264) and U+20D2, "combining long vertical overlay"; almost no relations negated by a vertical cancellation are represented by single unicodes, and unless the basic principles of unicode assignment change, this will remain the norm.
by adding a defined "variation selector" (U+FE00) to designate recognized (i.e., officially by unicode) variants that are unable to be modified by addition of a combining diacritic, such as \lvertneqq (less than but not equal to with vertical negation of only the equals sign, U+2268,U+FE00).

unicode technical report #25, unicode support for mathematics, deals with these methods in sections 2.17 and 2.18 (pages 26 ff.).

How are the glyph (character) names in PDF-files determined?

1 Answers1

Linked