Full Unicode support for a document with a large number of languages

Question

I use LaTeX (in detail I use pdflatex from MikTeX) for automatically generating documents based on data gathered from the Internet. Of course this means that those pages can contain small parts of other languages such as Chinese, Korean, Arabic, Turkish, ...

Therefore I don't know what languages are used and where they are used within my document.

Until now I have replace all characters from foreign languages by a dot to avoid problems, but in some situations e.g. company names then end up as ..... instead of some Chinese characters.

Is it possible to build one LaTeX document that allows all available languages or at least a large set of languages?

If it is possible how do I know what Unicode characters are supported in this document and which not so I can filter them before generating the .tex sources?

with pdflatex this is very hard. With lualatex you can cover a rather large part of unicode. — Ulrike Fischer, Apr 30 '21 at 12:53
Indeed using LuaLaTeX or XeLaTeX you can display almost all unicode text. The most important constraint is that you need a font that has coverage for all the different scripts. You can also combine fonts such that different scripts are displayed with different fonts automatically, see for example https://tex.stackexchange.com/questions/514940/define-fallback-font-for-missing-glyphs-in-lualatex. — Marijn, Apr 30 '21 at 12:55
"based on data gathered from the Internet" - in case of problems, it may be a good idea to check the encoding of the original data - not all websites use unicode (and unicode can be UTF-8, UTF16BE, UTF16LE...). So the first step must be: Make one encoding (e.g. UTF-8). — knut, Apr 30 '21 at 13:24
@knut Yes, correct encoding is a requirement. I did not mention it, but the data is from web pages, but from a JSON API and therefore already UTF-8 encoded. — JMax, Apr 30 '21 at 13:40
@UlrikeFischer Is LuaTeX still so slow? My documents can get a little large and there are a lot of them to generate. pdflatex is already not very fast (up to 20 minutes for a large document on a 3 year old Xeon server). — JMax, Apr 30 '21 at 13:44
@Marijn the new fallback key in luaotfload if better to setup fonts which covers a large range of scripts, see the luaotfload docu or e.g. https://tex.stackexchange.com/a/572220/2388 — Ulrike Fischer, Apr 30 '21 at 13:48
@JMax for short documents lualatex is typically a bit slower, but for large documents it depends a lot on the content. You will have to try it out. — Ulrike Fischer, Apr 30 '21 at 13:51

Davislor · Answer 1 · 2021-04-30T14:35:52.307

Here is an example of a document that automatically detects several different scripts.

It’s not always possible to detect which language is using a given script, for example, whether you’re processing Arabic versus Persian, or Spanish versus French, Unfortunately, there are a few languages that write the same Unicode codepoints differently, such as Japanese Kanji and traditional Chinese, and you could not display those correctly without tagging.

The simplest solution is to select a font that supports a large number of scripts, such as FreeSerif or DejaVu Sans. No OpenType font can support all of Unicode, but you probably only care about languages that are still spoken today.

Oni · Answer 2 · 2021-04-30T13:18:49.237

0

With xelatex and polyglossia one can make multilanguage pdf's. For asia languages one need ucharclasses too.

% !TEX TS-program = xelatex
% !TEX encoding = UTF-8 Unicode
\documentclass[12pt, a4paper]{article}
\usepackage{polyglossia}
\setmainlanguage[variant=british]{english}
\setotherlanguages{hebrew, greek, japanese}
\newfontfamily\hebrewfont{SBL Hebrew}
\newfontfamily\greekfont{SBL Greek}
\newfontfamily{\cjkfont}{WenQuanYi Zen Hei}
\usepackage[CJK]{ucharclasses}
\setDefaultTransitions{\defaultfont}{}
\setTransitionsForCJK{\cjkfont}{}
\title{Title}
\begin{document}
\section{First}
\textgreek{αταραξία}. That was in Greek using SBL Greek.
\texthebrew{קֹהֶלֶת}. That was in Hebrew using SBL Hebrew.
\textjapanese{東南西北} That was in CJK using WenQuanYi.
\end{document}

edited Apr 30 '21 at 13:18

answered Apr 30 '21 at 13:10

Oni

705

But that still requires to scan the text I have and based on the Unicode characters try to build those language specific sections. This is not simply pasting UTF-8 characters in a document and go like it is in HTML :( – JMax Apr 30 '21 at 13:24
You can do that, but probably get main font for those characters, other than japanese. – Oni Apr 30 '21 at 13:49

Full Unicode support for a document with a large number of languages

2 Answers2