47

For example, consider the Swedish letter "Å" which is also the symbol for a unit of length in chemistry. When writing a document, there are three different ways to achieve this character:

  1. Å
  2. \AA
  3. \r{A}

Given that there are three different ways placing the character Å into a document, is there any reason to prefer one over the other? Unicode is pretty ubiquitous nowadays, so is there any reason I wouldn't want to just place the symbol directly into the document?

David Carlisle
  • 757,742
  • 12
    They should all work in most contexts, but to typeset units, you might use a package like siunitx. – Davislor Oct 06 '21 at 20:59
  • 3
    Can you remember how to insert the unicode character? (Assuming you don't have a Swedish keyboard, of course.) Does the insertion method even work in your environment? Doesn't always in mine: after all, the point of LaTeX is that you can write in a simple text editor, which probably just understands ASCII. – jamesqf Oct 07 '21 at 16:49
  • 1
    Personally, I'm using Overleaf. On a mac, I can insert "Å" with Option+Shift+A – AmphotericLewisAcid Oct 07 '21 at 17:51
  • 2
    And people like me have keyboard with layer like this https://bepo.fr/wiki/Symboles_scientifiques. – Archange Oct 07 '21 at 18:35
  • For infrequent characters I prefer to use the compose key. In many cases it is intuitive to use. On systems lacking this useful feature you can install additional software like WinCompose as an example. ------ @AmphotericLewisAcid I am wondering why your question in the title (why prefer Unicode) is negated compared to the question at the end of the body (why not to prefer Unicode)? – pabouk - Ukraine stay strong Oct 08 '21 at 08:15
  • @pabouk: But you have the same problems (or at least I do) with compose keys as with unicode: unless you have memorize the key combinations for a few characters you use frequently, how do you go about looking up the characters you want? While if you've used LaTeX for a while, you probably have a lookup table handy. – jamesqf Oct 08 '21 at 16:26

7 Answers7

32

Do you use BibTeX to create a formatted bibliography? If so, let's assume the bibliography contains entries authored by persons with surnames Åsgard and Æsgard. You may be surprised to learn that BibTeX will sort these surnames alphabetically after Z, whereas BibTeX will sort {\AA}sgard and {\AE}sgard with the As. If the publication is in English, your readers may get very confused if you spell the authors' surnames using Å and Æ and their publications don't show up under A. (For more information on the subject of using accented characters under BibTeX and the attendant sorting issues, please see the posting How to write “ä” and other umlauts and accented letters in bibliography?)

Please note that the preceding paragraph applies to BibTeX — and to non-Unicode-aware versions of BibTeX in particular. (AFAICT, very very few BibTeX users employ bibtexu, which is said to be Unicode-aware.) If one uses biblatex+biber instead of BibTeX, the sorting issue can be adjusted on the fly to adhere to language-specific sorting standards. I.e., you can set suitable options to instruct biber to keep the entries authored by Åsgard and Æsgard with the As or, alternatively, place them after the Zs.

Mico
  • 506,678
  • 2
    Ah, that makes total sense! I figured there was something I wasn't considering, thanks! – AmphotericLewisAcid Oct 06 '21 at 20:51
  • 19
    I think this answer stands to be a bit misleading to many people, even though it's correct. Many people use "use BibTeX" to mean "have a .bib file and do citations automatically", whether they are using biblatex + biber or using e.g. natbib + bibtex. If you are using biblatex and biber there is no need to use {\AA}sgard. Furthermore, in the context of the rest of the document, there is also no need to use it anyway. – Alan Munn Oct 06 '21 at 20:53
  • @Mico If you have a better option to not mislead people into believing that they have to not use Unicode if they do a bibliography, then please substitute my addition with it. Else, your current answer produces exactly that effect (hell, I do use biblatex and for some seconds started wondering whether I was doing wrong with my Unicode-only approach, until I read the comments —which a lot of people won’t do, whether they know biblatex or not). – Archange Oct 07 '21 at 06:41
  • @Mico Your edit is perfect, thanks! – Archange Oct 07 '21 at 07:18
  • @AlanMunn - I've added a second paragraph to the answer, to (hopefully) make clear that the first paragraph pertains to BibTeX (the program), not to bibliography creation in general. – Mico Oct 07 '21 at 08:50
  • 11
    On the other hand, a Swedish reader would be surprised to find Å and Ä along A instead of after Z. – md2perpe Oct 07 '21 at 14:29
  • @md2perpe - What's your point? Are you contradicting or otherwise calling into question the correctness of the statement "If the publication is in English, your readers may get very confused."? If so, on what basis? – Mico Oct 07 '21 at 14:42
  • 1
    @Mico. It was a response to that statement, but it was just a comment. If the paper is in English I think that it's best to sort Å and Ä as A. That will be least confusing to a majority of the readers. Only readers from the Nordic countries will be confused. – md2perpe Oct 07 '21 at 15:22
  • 4
    Ideally, the tools doing the sorting show know (or be capable of being told) how to sort the alphabet in use, regardless of how the characters in that alphabet are being represented. \{AA} and U+00C5 (Å), for example, should be treated identically. Whether Å precedes or follows Z is not something Unicode concerns itself with. (The order of the codepoints often aligns with, but should not specify, commonly used sort orderings.) – chepner Oct 07 '21 at 15:49
  • 4
    @chepner - I mention the programmable sorting capability in the second paragraph of my answer, as one of the features of biblatex/biber. BibTeX, which is far older than biber (to the tune of decades older) does not. – Mico Oct 07 '21 at 16:00
32

Assuming you are using pdflatex (not a Unicode TeX such as Lualatex) then

  • Å is defined by \DeclareUnicodeCharacter{00C5}{\r A} so it expands essentially to to \r{A}

  • \AA is defined by \def \AA {\r A} so it expands to the same thing as Å

  • \r{A} is defined (in T1 encoding) by \DeclareTextComposite{\r}{T1}{A}{197} so it adds the character from slot 197 in the T1 encoding, which is Å

So to latex, all three are the same although it I wouldn't really describe using Å as "just place the symbol directly into the document"

Which is most convenient depends a lot on your keyboard, I can't easily type Å on this one for example.

David Carlisle
  • 757,742
16
  • Not every font includes the high Unicode glyphs, so putting one of them in a macro allows me to redefine it as necessary. You can still use the Unicode character in the macro definition.
  • Sometimes it is more convenient to create a macro and write, say, \undertie{}, than to look up the Unicode point and enter it manually, or to use a graphical character menu to select it.
  • Some Unicode characters are not clearly rendered on the console (like combining characters), so those of us who write in an editor like Vim may prefer a macro whose meaning is obvious.
musarithmia
  • 12,463
  • 2
    Agreed that is easier to input as a command. That is what I do as well. However, nothing stops the command definition to expand to a unicode character. For many writing systems as well as European scripts the shaping engine can also combine characters in a certain way. – yannisl Oct 06 '21 at 21:12
  • 2
    Usually the keyboard mapping used in countries with non-English symbols includes these symbols so entering them is trivial. – Jack Aidley Oct 07 '21 at 08:01
  • 1
    @JackAidley Yes, I use US International Alt Gr layout, so most characters like áéüñµç¡ are easy to type, but not the really high unicode values (music notes, prosody marks, emojis, etc.). – musarithmia Oct 07 '21 at 12:20
13

PDFTeX, unlike XeTeX or LuaTeX, cannot understand combining characters such as U+030A. So, Å in decomposed form would fail, and you would need to use Å (U+00C5).

This contradicts the Unicode standard, which says canonically-equivalent characters should have the same meaning, but the maintainers have said that it is a technical limitation of the 8-bit PDFTeX engine.

Not all hyphenation patterns understand combining characters in the newer engines, either. I’ve had documents hyphenate between a base character and its accents. This might have been fixed by now, at least to prevent a line break before a naked accent, but not all languages will hyphenate words with certain combining accents correctly. In LuaLaTeX or XeLaTeX, \usepackage{inputnormalization} will convert the text of the document into NFC form (that is, the precomposed standard form used on nearly all Web pages).

Davislor
  • 44,045
7
  • Using the unicode character will allow you to re-use the same portion of text outside of LaTeX directly. For example, you could email someone a paragraph directly from your tex file.
  • Using the LaTeX command is often more intuitive, as you are simply writing the name of the character, rather than having to remember keyboard macros for unicode (unless you are using a keyboard mapping which includes these symbols).
  • Whichever you decide to use, it is most important you remain consistent throughout your document.
6

Adding to the other responses:

To generalize the ambit of the question, and for text mode only (math mode is a different realm), the answer is, "it depends", since which input method is the most suitable to use will vary with the circumstances, even within the same document.

With a few glyphs, combined with a keyboard, or keyboard overlay, or copy-paste source, or insertion mechanism, then "direct" input is one way:

direct method

Note that the unit symbol is a different glyph codepoint than the letter.

With a few glyphs, and without a keyboard or overlay etc, "indirect" input by specifying code points can be used:

indirect method

To save typing, commonly-used characters can be assigned to suitably named macros:

named macros

And to save continually swapping keyboards in multi-character, multi-script scenarios (assuming there are keyboards or overlays defined, or definable, to start with), or to save having to type many different codepoint digits, then a transliteration method could be suitable (for example, using expl3 find-replace commands behind the scenes):

transliteration

MWE

\documentclass{article}
\usepackage[table]{xcolor}
\newcommand\thdr{\rowcolor{blue!8}}
\usepackage{fontspec}
\setmainfont{Noto Serif}
\newfontface\fcunei{Noto Sans Cuneiform}[Scale=1.3]
\newfontface\fhiero{Noto Sans Egyptian Hieroglyphs}[Scale=1.5]
\newcommand\abos{\Large\phantom{(}}
\usepackage{xparse}

\DeclareTextFontCommand\textcunei{\fcunei} \DeclareTextFontCommand\texthiero{\fhiero}

\newcommand\ringa{Å} \newcommand\alan{\textcunei{}} \newcommand\seatedman{\texthiero{}}

\ExplSyntaxOn

\newcommand\doreplace[2]{ \tl_replace_all:Nnn \l_mytemp_tl { #1 } { #2 }

}

\tl_new:N \l_mytemp_tl \NewDocumentCommand { \tlcunei } { m } {% \tl_set:Nn \l_mytemp_tl { #1 } \doreplace{alan}{} \doreplace{deity}{} \doreplace{lady}{} \tl_use:N \l_mytemp_tl }

\NewDocumentCommand { \tlhiero } { m } {% \tl_set:Nn \l_mytemp_tl { #1 } \doreplace{RA}{xxx} \doreplace{seatedman}{} \doreplace{falcon}{} \doreplace{ibex}{} \doreplace{N5}{} \doreplace{Z1}{} \doreplace{C2}{} \doreplace{r}{} \doreplace{a}{} \doreplace{xxx}{\RA} \tl_use:N \l_mytemp_tl }

\newcommand\RA{ \begin{tabular}{ccc} \begin{tabular}{c} \tlhiero{r}\\tlhiero{a}\ \end{tabular} & \begin{tabular}{c} \tlhiero{N5}\\tlhiero{Z1}\ \end{tabular} & \begin{tabular}{c} \Large\tlhiero{C2}\ \end{tabular} \ \end{tabular}}

\ExplSyntaxOff

\begin{document}

\begin{tabular}{cl} \thdr Glyph & Method (``direct'') \ \hline Å & U+212B ANGSTROM SIGN \abos\ Å & U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE \ Å & A + ̊ U+030A COMBINING RING ABOVE \ \fcunei & U+12029 CUNEIFORM SIGN ALAN \ \fhiero & U+13000 EGYPTIAN HIEROGLYPH A001 : seated man \ \hline \end{tabular}

\bigskip \begin{tabular}{cl} \thdr Glyph & Method (``indirect'') \ \hline \rowcolor{yellow!15} ^^^^212b, \Uchar"212B, \symbol{"212B} & \verb|^^^^212b, \Uchar"212B, \symbol{"212B}| \abos\ \rowcolor{green!15} ^^^^00c5, \Uchar"00C5, \symbol{"00C5} & \verb|^^^^00c5, \Uchar"00C5, \symbol{"00C5}| \ \rowcolor{yellow!15} ^^^^0041^^^^030a, \Uchar"0041\Uchar"030A, \symbol{"0041}\symbol{"030A} & \parbox{0.4\textwidth}{\ttfamily \textasciicircum\textasciicircum\textasciicircum\textasciicircum0041\textasciicircum\textasciicircum\textasciicircum\textasciicircum030a, \textbackslash Uchar"0041\textbackslash Uchar"030A, \textbackslash symbol{"0041}\textbackslash symbol{"030A}} \ \rowcolor{green!15} \fcunei ^^^^^^012029\Uchar"12029\symbol{"12029} & \verb|^^^^^^012029,\Uchar"12029,\symbol{"12029}| \ \rowcolor{yellow!15} \fhiero ^^^^^^013000\Uchar"13000\symbol{"13000} & \verb|^^^^^^013000,\Uchar"13000,\symbol{"13000}| \ \hline \end{tabular}

\bigskip \begin{tabular}{cl} \thdr Glyph & Method (``named macros'') \ \hline \ringa & \verb|\ringa| \abos\ \alan & \verb|\alan| \ \seatedman & \verb|\seatedman| \ \hline \end{tabular}

\setlength{\tabcolsep}{0pt} \bigskip \begin{tabular}{cl} \thdr Glyph & Method (``transliteration'') \ \hline \fcunei\tlcunei{alan deitylady} & \verb|\tlcunei{alan deitylady}| \abos\ \fhiero\tlhiero{RA} & \verb|\tlhiero{RA}|\ \fhiero\tlhiero{seatedman ibex falcon} & \verb|\tlhiero{seatedman ibex falcon}| \ \hline \end{tabular}

\end{document}

Cicada
  • 10,129
2

If your keyboard contains entry for these unicode symbols, it may be faster and easier to type them directly as unicode, or copy/paste instead of looking up the LaTeX command. It can also make the text easier to read and edit: Jörg instead of J{\"o}rg.

qwr
  • 339