After switching to OS X one of the first things I had to learn the hard way is that many non-ASCII characters, such as the German ü can be encoded in (at least) two different forms in UTF8:
- U+00FC (LATIN SMALL LETTER U WITH DIAERESIS): Normalized Form C (NFC)
- U+0075 U+0308 (LATIN SMALL LETTER U WITH COMBINING DIAERESIS): Normalized Form D (NFD)
(The glory details are all described here)
Basically, all operating systems and applications today use NFC only, with the exception of Mac OS X, in which some applications (e.g., OpenOffice or the HFS+ file system) use NFD. The result is that if you copy & paste some text from such an application (e.g., the output of the ls command) into your LaTeX document, everything looks fine.
\documentclass{article}
\usepackage[utf8]{inputenc} % comment out for lualatex/xelatex
\usepackage[T1]{fontenc} % comment out for lualatex/xelatex
\begin{document}
äöüÄÖÜß
\end{document}
However, when compiling with pdflatex:
! Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX.
A often given answer with respect to unicode problems is "use lualatex/xelatex". However, that does not seem to help here either. If compiling with lualatex/xelatex, the output does not contain the umlauts:
Question: The inputenc package with [utf8] is apparently not able not handle NFD. Is it possible to extend it so that the above does compile?
WARNING
Note that the MWE, if copied & pasted from here into a new document, actually does compile. Apparently either my browser or the SE site transparently transforms NFD to NFC. (For Safari and Crome that seems to be the case indeed; I have also tried Firefox without success). I have yet to figure out how to provide some piece of text in NFD here.
Excursus: A Bit of Extra Background on HFS+
I first stumbled over this issue when trying to put the output of a ls command into my LaTeX document: The source of many, many problems in OS X is that the HFS+ file system uses (for some totally weird reasons) NFD. Even worse: HFS+ transparently transforms all NFC characters it gets as input into NFD internally. Practically, this means that the filenames you get out are different than those you have put in: If you create a file ü (the keyboard delivers NFC) and then list the directory (the file system delivers NFD) , the name looks same, but in fact is different. A short illustration test (executed in an empty dir):
$ echo ü; echo ü | xxd; touch ü; ls; ls | xxd
ü
0000000: c3bc 0a ...
ü
0000000: 75cc 880a u...
This is the reason so many tools (unison, svn, git, ...) or bash's tab completion choke on OS X on filenames containing umlauts – and that you cannot use the output of ls directly in your LaTeX document.



ls. It's really annoying. – You Jan 18 '13 at 23:18lsinto my LaTeX document. They are probably also the instance on which OS X users sooner or later will hit the root of the problem. However, rephrased the excursus to make this more clear. – Daniel Jan 19 '13 at 11:58\XeTeXinputnormalization=1to normalize input into NFC. – خالد حسني Jan 19 '13 at 14:38