Ignore UTF-8 BOM with csvsimple?

Question

I am currently using csvsimple to include some external data in a document processed with pdfLaTeX. The entire project uses UTF-8 as encoding. The original data is maintained in an Excel 2010 file, and since Excel does not support exporting to UTF-8 CSV directly, I have to go through a rather cumbersome process:

export to CSV
open with Notepad
save from notepad, changing the encoding

This process adds a BOM to the CSV file - 0xef 0xbb 0xbf. On my Windows desktop, this does not seem to be a problem, but the CI build on a Linux box breaks down with Missing \endcsname inserted.. Is there a way to tell csvsimple to ignore the BOM or do I have to edit it out before the compilation starts?

Related: Google docs to TeX and pdf shell script produces a blank first page, and gives me a “! LaTeX Error: Missing \begin{document}.” error

Several editors don't add a BOM, which seems to be characteristic with Notepad. There are utilities that change the encoding without the need of opening an editor (and don't add the BOM) like Charco but there are surely several others. I'm quite surprised that a well known application doesn't allow exporting as UTF-8 in 2014 (29 years after its first release). — egreg, Oct 31 '14 at 09:13
@egreg You're right, but I was equally surprised that a well known (text) file processing software is unable to handle BOMs in the input :-) — vwegert, Oct 31 '14 at 09:19
TeX predates UTF-8 by several years. Maybe this could be a feature request for the next release of TeX Live. — egreg, Oct 31 '14 at 09:23
Is using another editor than Notepad an option? Notepad++ e.g. allows to save without BOM. — Thomas F. Sturm, Oct 31 '14 at 09:36
The BOM problem may be pdflatex specific. I run a test with xelatex using csvsimple and a BOM coded input file without problems. — Thomas F. Sturm, Oct 31 '14 at 09:39
@ThomasF.Sturm Using another editor is definitely possible and will probably be my workaround for the time being. Excel apparently throws out UTF16 or Windows-1252, and I just need some tool to convert this to UTF-8. Switching to xelatex probably won't be an option - it's a rather large project, and from my experience, changing the compiler tends to break a lot of stuff... — vwegert, Oct 31 '14 at 09:46
Maybe I found a solution for you (see my answer). This worked with my test file using pdflatex. Nevertheless, using another editor may be better than hacking the characters (?). — Thomas F. Sturm, Oct 31 '14 at 09:51
Do you use Notepad or Notepad++? With Notepad++ you can configure the encoding of the file. See menu encoding, switch to utfß. — Mensch, Oct 31 '14 at 09:54
(Extract from Byte order mark) The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. — Paul Gaborit, Oct 31 '14 at 09:57

score 4 · Accepted Answer · answered Oct 31 '14 at 10:45

4

As you input the file you could try to simply declare the BOM:

\documentclass[]{book}
\usepackage[utf8]{inputenc}
\DeclareUnicodeCharacter{FEFF}{}
\begin{document}
\input{test-with-bom}
\end{document}

answered Oct 31 '14 at 10:45

Ulrike Fischer

327,261

Would this answer have solved the following question?: http://tex.stackexchange.com/q/284916/90087 If yes, it would be nice to have it posted as an answer. – A Feldman Apr 16 '16 at 17:38
@AFeldman: No, in your case the \DeclareUnicodeCharacter would be too late. – Ulrike Fischer Apr 16 '16 at 17:41
Thank you, appreciated. I was thinking that it would be preferable to have an answer that did not depend on sed and an external shell script in dealing with the google doc BOM. – A Feldman Apr 16 '16 at 17:42

Ignore UTF-8 BOM with csvsimple?

1 Answers1

Linked

Related