32

Unicode applies the convention of using a byte order mark as signature at the beginning of a text stream, identifying the encoding used within it. The following three bytes at the beginning of a file: EF BB BF, identifies this file as a UTF8 file. Vi and most text editors gracefully ignore this signature. Open office does that as well, but it adds this signature at the beginning.

Now, if I open a TeX file with OpenOffic.org, and I do that quite a lot (for mixed directionality editing), the signature is added, and baffles LaTeX, which produces an error message such as:

 ! LaTeX Error: Missing \begin{document}.

 See the LaTeX manual or LaTeX Companion for explanation.
 Type  H <return>  for immediate help.
 ...                                              

l.1 
 ��\documentclass{article}

whereas a simple dump of the file does not show the problem. Is there a way to eliminate this problem, while staying in the realm of latex?

Yossi Gil
  • 15,951

7 Answers7

19

The inputenc package can only work from when it is loaded, so obviously not at the very start of the document. Until then, only ASCII should be used.

For real unicode handling, use a Unicode capable engine like XeTeX or LuaTeX (in your case, xelatex or lualatex). Then you don't need inputenc (and likely want to change some other packages, too).

Update: from the April 2018 LaTeX release, UTF-8 is the default encoding of LaTeX also with non-unicode engines. Thus this error should not happen anymore (you might get different errors for non-unicode documents, though). See the LaTeX News 28 (PDF) for details.

  • 1
    Finally, Unicode for all! Sadly, the 2018 release will not make it into the upcoming long-term release of Ubuntu, so we will be hearing about Unicode problems for at least two more years. –  Apr 24 '18 at 00:59
  • so we don't have to put " \usepackage[utf8]{inputenc} " in our preamble to use utf8 characters anymore, when using pdflatex? – user12711 Aug 09 '19 at 03:23
19

You can e.g. change the \catcode of the three problematic bytes before you input your file:

pdflatex \catcode239=9 \catcode 187=9 \catcode 191=9 \input test-bom

You should then reset the catcodes in the document to 12.

Ulrike Fischer
  • 327,261
  • @jfbu Imho the code isn't needed anymore. All texsystems ignores BOM now. – Ulrike Fischer Nov 29 '17 at 10:10
  • not in my testing with TL2017 on mac os ... –  Nov 29 '17 at 10:23
  • @UlrikeFischer No, I tried and the BOM doesn't get ignored with pdftex. But it's not necessary to reset the category codes, because inputenc will do it. It seems, instead, that xetex and luatex ignore the BOM. – egreg Nov 29 '17 at 10:59
  • @egreg On windows it works fine both with miktex and texlive. But I have a faint recollection that Akira wrote that it works only there. – Ulrike Fischer Nov 29 '17 at 11:45
  • If you're willing to modify the command-line invocation, consider using Rmano's answer below which uses more "documented API" and maybe avoid having to reset the catcode inside the document. – user202729 Jun 13 '22 at 11:10
11

With the upcoming LaTeX release (2018) the BOM issue will be resolved on kernel level so that there will be no need any more for the gymnastics attempted in the other answers (which of course have been necessary up to now) even for pdftex / 8bit-TeX engines.

9

The non-Unicode engines don't know anything about Unicode or UTF-8. A similar problem would arise if you tried to place a BOM at the beginning of a Unix script, which would have the kernel ignore the shebang line.

Avoid BOMs in TeX or LaTeX documents, they are neither necessary nor recommended. The Unicode-capable engines XeTeX and LuaTeX, however, handle BOMs just fine.

Philipp
  • 17,641
  • This is indeed a challenge, but how about telling latex, through the command line, or by some other clever method, to load the package prior to loading the main file. The BOM is necessary if you want to edit mixed directionality text, since OpenOffice is the main solid editor that does that best, enabling mixed directions in individual lines. – Yossi Gil Feb 07 '11 at 13:25
  • @Yossi: See Ulrike's answer. – Philipp Feb 07 '11 at 14:00
  • 7
    You may try to do latex '\RequirePackage[utf8]{inputenc}\input{file}'. – Paŭlo Ebermann Feb 07 '11 at 14:51
  • @Yossi : for mixed directionality text see XeTeX or LuaTeX – PHL Feb 07 '11 at 17:08
  • 1
    @PaŭloEbermann Could you please add that as an answer? It is a better option than some others here and a worthy alternative to others. – cfr Dec 28 '15 at 04:43
  • @cfr I didn't put it in my answer because I didn't try it myself. Did you try it? – Paŭlo Ebermann Jan 03 '16 at 20:52
  • @PaŭloEbermann I can't remember. But there was some reason that I thought when I wrote that comment that your method would work. However, I can't now recall why. (It was to do with discussion of another question, but I can't remember anything else about that question.) – cfr Jan 03 '16 at 22:31
  • 1
    @cfr It does not work because of Unicode char (U+FEFF) not set-up for use with LaTeX. –  Nov 29 '17 at 10:21
3

The problem is even here when \inputing other files, even if you use the correct input encoding. Suppose for example that you have a (probably generated) file BOM-problem.tex (with a BOM mark in it) and you do:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
    \input{BOM-problem.tex}
\end{document}

you will have the error

BOM-problem.tex|1 error| Package inputenc Error: Unicode char  (U+FEFF)

My solution is simply to add

\DeclareUnicodeCharacter{FEFF}{}

in the preamble --- no more errors or warnings.

It happens a lot to me with files saved as UTF-8 by LibreOffice --- so I hope it helps at least to solve a subset of the OP problem.

If the problem is that the complete latex file is preceded by the bom, you can (mixing this solution and @Paŭlo Ebermann one in the comments above) doing the following:

 pdflatex -jobname=filewithBOM \
 '\RequirePackage[utf8]{inputenc}\DeclareUnicodeCharacter{FEFF}{}\input{filewithBOM.tex}' 

(all in one line, obviously) ...and maybe massaging it into some sort of script.

Rmano
  • 40,848
  • 3
  • 64
  • 125
1

bomstrip may help you here to get rid of the BOM characters in the header of the files. The LaTeX error I got was Package inputenc Error: Unicode char  (U+FEFF)(inputenc) not set up for use with LaTeX.

Max N
  • 425
1

As an alternative to Ulrike Fischer's answer, you can do this

pdflatex \\def\\foo#1#2#3{}\\expandafter\\expandafter\\expandafter\\foo\\csname @@input\\endcsname test-bom

Explanations:

  • UlrikeFischer's answer requires the document itself to reset some catcodes (for example the » is utf8 encoded C2BB, hence resetting the catcode187 from 9 (aka "ignore") is really mandatory), this one only leaves a pre-defined \foo macro as trace,

    edit: actually if the document uses inputenc the said catcodes will be reset by it, so a problem might arise only in some very improbable case e.g. the document does some \typeout{...»...} before actually loading inputenc. Thanks to @egreg who pointed that out in a comment to @UlrikeFischer's answer.

  • LaTeX's \input is not the TeX's one, I wanted to do \expandafter\foo\input but as is known LaTeX's \input is not expandable and one has to use the TeX's one whose meaning is preserved by LaTeX in \@@input,

  • but @ is not a letter at this stage, and we don't want to change catcodes, (See alternative at bottom)

  • hence \csname...\endcsname but then we need to expand twice,

  • LaTeX does not have a "gobble three". Oh, it has a "gobble four" but then again a \csname due to @ not being letter,

  • Finally, I needed to escape from my shell the backslash, done here with \\.

Again the advantage of this mouthful is that it will work with any mistreated tex file which acquired a BOM. But you have to be sure it does have a BOM, naturally.

Other way which has the advantage of leaving absolutely no trace:

pdflatex \\makeatletter\\@firstofone{\\makeatother\\expandafter\\@gobblefour\\expandafter\\x\\@@input} test-bom

The method also applies for doing an \input of some bar-with-bom.tex from inside another tex file.