9

Let's run this snippet with xelatex:

\documentclass{article}
\usepackage{fontspec}
\usepackage{xeCJK}
\setCJKmainfont{Source Han Serif SC}  % SimSun is OK
\begin{document}
见
\end{document}

The document body is a single Chinese character, 见 (U+89C1). Using the font Source Han Serif SC, the generated PDF contains another character ⻅ (U+2EC5). Using SimSun, the generated PDF contains the original 见 (U+89C1).

Could anyone tell me who is to blame: fontspec, xeCJK, or Source Han Serif SC; and how to stick to the original character in all cases? Thanks.

Marijn
  • 37,699
wdscxsj
  • 628
  • if you add \showoutput which character does it show in the log? – David Carlisle Oct 12 '18 at 08:19
  • Thanks for the tip. The log shows 见 (U+89C1). – wdscxsj Oct 12 '18 at 08:36
  • so I think that means xetex and harfbuzz think that is the character added, so I think it points to something in the font (I don't have the font to test) If you remove xecjk and just use \setmainfont you should be able to compare with lualatex, what happens there... – David Carlisle Oct 12 '18 at 08:45
  • can you check with fontforge or similar if the Han Serif SC font has a 89C1 glyph? – David Carlisle Oct 12 '18 at 08:47
  • Wow, didn't think of that! It turns out Source Han Serif SC doesn't have this glyph. It feels quite weird because 见 is a very common Chinese character, and Source Han Serif SC a quite comprehensive font. Thanks a lot for the help! – wdscxsj Oct 12 '18 at 09:07
  • Sorry I used a simplistic font viewer to make the conclusion too soon. SourceHanSansSC-Regular.otf does contain 见 (U+89C1) but in an unusual way. The glyph is cid37659 which has 2 codepoints, U+2EC5 and U+89C1. XeLaTeX chooses the lower value by default, I suppose? Adding \XeTeXgenerateactualtext=1 fixes the issue. It's exactly the same issue as https://github.com/CTeX-org/ctex-kit/issues/286 and http://tug.org/pipermail/xetex/2017-June/027142.html. Thanks again for your guidance. – wdscxsj Oct 13 '18 at 02:05
  • 1
    Can you post that as an answer and self-accept it, thanks. Glad you sorted it out. – David Carlisle Oct 13 '18 at 07:44

1 Answers1

12

It turns out that SourceHanSansSC-Regular.otf does contain 见 (U+89C1) but in an unusual way. The glyph is cid37659 which has 2 codepoints, U+2EC5 and U+89C1. XeLaTeX somehow chooses the lower value. Adding \XeTeXgenerateactualtext=1 fixes the issue for PDF viewers that support the "actual text" feature (Adobe Reader does; SumatraPDF doesn't).

The same issue was reported in https://github.com/CTeX-org/ctex-kit/issues/286 (partly in Chinese) and http://tug.org/pipermail/xetex/2017-June/027142.html. It seems that a patch has been submitted but not yet universally adopted (my environment: MiKTeX 2.9 with the latest update on Windows 10).

Many thanks to David Carlisle for the gentle guidance.

wdscxsj
  • 628