Having problems with listings and UTF-8. Can it be fixed?

Question

I'm having some problems with listings and UTF-8 in my document. Maybe someone can help me? Some characters work, like é and ó, but á and others appear at the beginning of words...

\documentclass[12pt,a4paper]{scrbook}
\KOMAoptions{twoside=false,open=any,chapterprefix=on,parskip=full,fontsize=14pt}

\usepackage[portuguese]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{listingsutf8}

\usepackage{inconsolata}
\lstset{
    language=bash, %% Troque para PHP, C, Java, etc... bash é o padrão
    basicstyle=\ttfamily\small,
    numberstyle=\footnotesize,
    numbers=left,
    backgroundcolor=\color{gray!10},
    frame=single,
    tabsize=2,
    rulecolor=\color{black!30},
    title=\lstname,
    escapeinside={\%*}{*)},
    breaklines=true,
    breakatwhitespace=true,
    framextopmargin=2pt,
    framexbottommargin=2pt,
    extendedchars=false,
    inputencoding=utf8
}

\begin{document}
\begin{lstlisting}
<?php

echo 'Olá mundo!';
print 'Olá mundo!';
\end{lstlisting}

\end{document}
\end{lstlisting}

The only thing I can find is in the listings manual (page 14) : Similarly, if you are using UTF-8 extended characters in a listing, they must be placed within an escape to LATEX. — Frédéric, Jul 31 '11 at 04:13
Also, did you have a look at the Related questions listed at the right? — Paŭlo Ebermann, Jul 31 '11 at 05:07
See also xetex - The 'listings' package and UTF-8 - TeX - LaTeX Stack Exchange for note on using UTF8 characters in listings package for Unicode-aware engines (lualatex, xelatex) — user202729, Jul 31 '22 at 05:24

score 73 · Accepted Answer · answered Jul 31 '11 at 08:13

73

One way to get around this limitation of listings is to use the option extendedchars=true and then to use the literate option for each accents you're going to be using (it's a bit tedious to do, but once you've done all the accents of your language, you never have to worry about them again). The syntax is

literate={á}{{\'a}}1 {ã}{{\~a}}1 {é}{{\'e}}1

For each accent you must put the real character inside braces (e.g. {á}) then you put what you want this character to be inside double braces (e.g. {{\'a}}) and finally you put the number one (1); between two entries, you can put a space for clarity.

Here's your example modified to use this:

\documentclass[12pt,a4paper]{scrbook}
\KOMAoptions{twoside=false,open=any,chapterprefix=on,parskip=full,fontsize=14pt}

\usepackage[portuguese]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}

\usepackage{inconsolata}
\lstset{
    language=bash, %% Troque para PHP, C, Java, etc... bash é o padrão
    basicstyle=\ttfamily\small,
    numberstyle=\footnotesize,
    numbers=left,
    backgroundcolor=\color{gray!10},
    frame=single,
    tabsize=2,
    rulecolor=\color{black!30},
    title=\lstname,
    escapeinside={\%*}{*)},
    breaklines=true,
    breakatwhitespace=true,
    framextopmargin=2pt,
    framexbottommargin=2pt,
    inputencoding=utf8,
    extendedchars=true,
    literate={á}{{\'a}}1 {ã}{{\~a}}1 {é}{{\'e}}1,
}

\begin{document}

\begin{lstlisting}
<?php

echo 'Olá mundo!';
print 'áãé';
\end{lstlisting}

\end{document}

answered Jul 31 '11 at 08:13

Philippe Goutet

28,978

2

How would ccedil work? ç? – KramerTheCat Jul 31 '11 at 15:25
3

Got it! {ç}{{\c{c}}}1 {Ç}{{\c{C}}}1 – KramerTheCat Jul 31 '11 at 15:52
1

When i have spaces inside strings using listings i get a strange character instead of a, well, space. Any ideas? – KramerTheCat Jul 31 '11 at 17:53
1

@KramerTheCat: you mean the sort of underscore instead of spaces? You can turn it off with by adding showstringspaces=false to your \lstset. – Philippe Goutet Jul 31 '11 at 18:52
1

It did worked for me as soon as I noticed that my custom package in wich I defined the option of the listing package was not coded in utf-8. A silly mistake, but time consuming. – M. Toya Apr 12 '12 at 16:38
1

A serious bug in one piece of LaTeX I just looked at was that extendedchars was set to \true, not true, so even though I did not need that setting in the end, your first line helped a lot to pinpoint the problem. – Anaphory Jul 13 '14 at 19:25
12

The Wikibook on LaTeX has a ready-made list of characters and their escaped versions that will “cover most characters in latin languages” that you can copy into your document instead of writing the entire thing yourself. – doncherry Sep 11 '16 at 19:27
1

The escaped characters for Czech can be found in one of the answers to this post: https://tex.stackexchange.com/questions/30512/how-to-insert-code-with-accents-with-listings – Jan 25 '19 at 20:10
1

when using the option showstringspaces=false in combination with your solution, spaces are not displayed anymore after any special character. Do you have an idea how to fix that ? – Manuel Selva Aug 28 '20 at 20:25
1

@ManuelSelva: I'm not experiencing your problem. Did you test with the code above or in a document of yours with other packages/options? If the problem persists, don't hesitate to ask a new question. – Philippe Goutet Aug 29 '20 at 08:53
@PhilippeGoutet thank you very much. While building a MWE, I found the problem. I had a column=flexible setting that I just removed to fix the issue. – Manuel Selva Aug 29 '20 at 11:46

score 31 · Answer 2 · edited Jul 13 '14 at 19:14

Escape those characters to LaTeX, as the documentation (listings manual, page 14) suggests:

Similarly, if you are using UTF-8 extended characters in a listing, they must be placed within an escape to LaTeX.

\documentclass[12pt,a4paper]{scrbook}
\KOMAoptions{twoside=false,open=any,chapterprefix=on,parskip=full,fontsize=14pt}

\usepackage[portuguese]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{listingsutf8}
\usepackage{xcolor}

\usepackage{inconsolata}
\lstset{
    language=bash, %% Troque para PHP, C, Java, etc... bash é o padrão
    basicstyle=\ttfamily\small,
    numberstyle=\footnotesize,
    numbers=left,
    backgroundcolor=\color{gray!10},
    frame=single,
    tabsize=2,
    rulecolor=\color{black!30},
    title=\lstname,
    escapeinside={\%*}{*)},
    breaklines=true,
    breakatwhitespace=true,
    framextopmargin=2pt,
    framexbottommargin=2pt,
    extendedchars=false,
    inputencoding=utf8
}

\begin{document}
\begin{lstlisting}
<?php

echo '%*Olá mundo*)!';
print '%*Olá mundo*)!';
\end{lstlisting}

\end{document}

enter image description here

This works better than the accepted answer. I started to use escapeinside={\%(*}{*)} so that the parentheses are matched in pair. — Hai Lang, Apr 27 '19 at 03:35
This approach is not better than the accepted one, as it affects the surrounding format. As an example, if you have a comments in italic, and use this approach, you will get a regular font instead. — Alberto, Dec 07 '20 at 19:22

Paŭlo Ebermann · Answer 3 · 2011-07-31T14:32:28.363

The way the inputenc package works with non-ASCII UTF-8-encoded characters (by making the first byte active and then reading the following ones as arguments) is fundamentally incompatible with the way the listing package works, which reads each byte individually and expects it to be an individual character.

The listingsutf8 package tries to work around this for the case that your characters are convertible to some 8-bit encoding (and you are using PdfLaTeX) - but this will work only with \lstinputlisting (as Marc's answer pointed out), not with inline listings. For inline listings the literate option (as pointed out by Phillipe) sounds good. An alternative would be escaping to LaTeX (as pointed out by Gonzalo) - but this makes simple cut-and-paste not work.

The last time I had to typeset a code which included non-ASCII Unicode characters (stuff like ℤ as Java identifiers, which are not in any 8-bit encoding, AFAIK), I switched to XeLaTeX, which supports UTF-8 input out of the box, without needing the inputenc package. With this, it worked nicely. I suppose LuaLaTeX would work the same way (but it was not that mature then).

(But I later wanted the comments to be formatted, too, thus I started/revived my ltxdoclet project to include source code and formatted comments.)

score 11 · Answer 4 · edited May 18 '23 at 17:54

Just to help people, here is a quite complete literate statement for using with lstlistings:

\lstset{
    inputencoding = utf8,  % Input encoding
    extendedchars = true,  % Extended ASCII
    literate      =        % Support additional characters
      {á}{{\'a}}1  {é}{{\'e}}1  {í}{{\'i}}1 {ó}{{\'o}}1  {ú}{{\'u}}1
      {Á}{{\'A}}1  {É}{{\'E}}1  {Í}{{\'I}}1 {Ó}{{\'O}}1  {Ú}{{\'U}}1
      {à}{{\`a}}1  {è}{{\`e}}1  {ì}{{\`i}}1 {ò}{{\`o}}1  {ù}{{\`u}}1
      {À}{{\`A}}1  {È}{{\`E}}1  {Ì}{{\`I}}1 {Ò}{{\`O}}1  {Ù}{{\`U}}1
      {ä}{{\"a}}1  {ë}{{\"e}}1  {ï}{{\"i}}1 {ö}{{\"o}}1  {ü}{{\"u}}1
      {Ä}{{\"A}}1  {Ë}{{\"E}}1  {Ï}{{\"I}}1 {Ö}{{\"O}}1  {Ü}{{\"U}}1
      {â}{{\^a}}1  {ê}{{\^e}}1  {î}{{\^i}}1 {ô}{{\^o}}1  {û}{{\^u}}1
      {Â}{{\^A}}1  {Ê}{{\^E}}1  {Î}{{\^I}}1 {Ô}{{\^O}}1  {Û}{{\^U}}1
      {œ}{{\oe}}1  {Œ}{{\OE}}1  {æ}{{\ae}}1 {Æ}{{\AE}}1  {ß}{{\ss}}1
      {ẞ}{{\SS}}1  {ç}{{\c{c}}}1 {Ç}{{\c{C}}}1 {ø}{{\o}}1  {Ø}{{\O}}1
      {å}{{\aa}}1  {Å}{{\AA}}1  {ã}{{\~a}}1  {õ}{{\~o}}1 {Ã}{{\~A}}1
      {Õ}{{\~O}}1  {ñ}{{\~n}}1  {Ñ}{{\~N}}1  {¿}{{?`}}1  {¡}{{!`}}1
      {°}{{\textdegree}}1 {º}{{\textordmasculine}}1 {ª}{{\textordfeminine}}1
      {£}{{\pounds}}1  {©}{{\copyright}}1  {®}{{\textregistered}}1
      {«}{{\guillemotleft}}1  {»}{{\guillemotright}}1  {Ð}{{\DH}}1  {ð}{{\dh}}1
      {Ý}{{\'Y}}1    {ý}{{\'y}}1    {Þ}{{\TH}}1    {þ}{{\th}}1    {Ă}{{\u{A}}}1
      {ă}{{\u{a}}}1  {Ą}{{\k{A}}}1  {ą}{{\k{a}}}1  {Ć}{{\'C}}1    {ć}{{\'c}}1
      {Č}{{\v{C}}}1  {č}{{\v{c}}}1  {Ď}{{\v{D}}}1  {ď}{{\v{d}}}1  {Đ}{{\DJ}}1
      {đ}{{\dj}}1    {Ė}{{\.{E}}}1  {ė}{{\.{e}}}1  {Ę}{{\k{E}}}1  {ę}{{\k{e}}}1
      {Ě}{{\v{E}}}1  {ě}{{\v{e}}}1  {Ğ}{{\u{G}}}1  {ğ}{{\u{g}}}1  {Ĩ}{{\~I}}1
      {ĩ}{{\~\i}}1   {Į}{{\k{I}}}1  {į}{{\k{i}}}1  {İ}{{\.{I}}}1  {ı}{{\i}}1
      {Ĺ}{{\'L}}1    {ĺ}{{\'l}}1    {Ľ}{{\v{L}}}1  {ľ}{{\v{l}}}1  {Ł}{{\L{}}}1
      {ł}{{\l{}}}1   {Ń}{{\'N}}1    {ń}{{\'n}}1    {Ň}{{\v{N}}}1  {ň}{{\v{n}}}1
      {Ő}{{\H{O}}}1  {ő}{{\H{o}}}1  {Ŕ}{{\'{R}}}1  {ŕ}{{\'{r}}}1  {Ř}{{\v{R}}}1
      {ř}{{\v{r}}}1  {Ś}{{\'S}}1    {ś}{{\'s}}1    {Ş}{{\c{S}}}1  {ş}{{\c{s}}}1
      {Š}{{\v{S}}}1  {š}{{\v{s}}}1  {Ť}{{\v{T}}}1  {ť}{{\v{t}}}1  {Ũ}{{\~U}}1
      {ũ}{{\~u}}1    {Ū}{{\={U}}}1  {ū}{{\={u}}}1  {Ů}{{\r{U}}}1  {ů}{{\r{u}}}1
      {Ű}{{\H{U}}}1  {ű}{{\H{u}}}1  {Ų}{{\k{U}}}1  {ų}{{\k{u}}}1  {Ź}{{\'Z}}1
      {ź}{{\'z}}1    {Ż}{{\.Z}}1    {ż}{{\.z}}1    {Ž}{{\v{Z}}}1
      % ¿ and ¡ are not correctly displayed if inconsolata font is used
      % together with the lstlisting environment. Consider typing code in
      % external files and using \lstinputlisting to display them instead.      
  }

Please feel free to edit this list with more/missing characters!

Additional characters for Vietnamese: https://stackoverflow.com/a/29197383

Additional characters for Greek: https://stackoverflow.com/a/33153163

This is great. Thank you! – stackoverflowuser2010 May 24 '21 at 04:49 — stackoverflowuser2010, May 24 '21 at 04:49

score 7 · Answer 5 · answered Jul 31 '11 at 07:15

7

With the listingsutf8 package and a traditional (not UTF-8) TeX engine, you have to use the \lstinputlisting command only, which properly displays a UTF-8 encoded file. You can't use the lstlisting environment, unless the code inside is plain ASCII.

answered Jul 31 '11 at 07:15

Marc Baudoin

2,676

score 1 · Answer 6 · answered Dec 06 '20 at 21:30

This is a modified version for adding support to Swedish and German characters (åäö üß) as well as Portuguese characters.

Put the following line in the header:

\usepackage{inconsolata} % Swedish encoding in lstlisting

and then where you want the code listing put the code below.

\lstset{
  language=bash, % Switch code language ... bash is the default
  basicstyle=\ttfamily\footnotesize,
  numberstyle=\tiny,
  numbers=left,
  backgroundcolor=\color{gray!10},
  frame=single,
  tabsize=2,
  rulecolor=\color{black!30},
  title=\lstname,
  escapeinside={\%*}{*)},
  breaklines=true,
  breakatwhitespace=true,
  framextopmargin=2pt,
  framexbottommargin=2pt,
  inputencoding=utf8,
  extendedchars=true,
  % Support for Swedish, German and Portuguese umlauts
  literate=%
  {Ö}{{\"O}}1
  {Ä}{{\"A}}1
  {Å}{{\AA{}}}1
  {Ü}{{\"U}}1
  {ß}{{\ss}}1
  {ü}{{\"u}}1
  {ö}{{\"o}}1
  {ä}{{\"a}}1
  {å}{{\aa{}}}1
  {á}{{\'a}}1
  {ã}{{\~a}}1
  {é}{{\'e}}1,
}
\lstinputlisting[language=bash]{your_code_file.txt}

score 0 · Answer 7 · answered Aug 01 '22 at 15:14

This answer (and other answers on this page) applies to pdflatex engine only. For other engines, see the linked question above.

This method is not really recommended for non-TeX-expert. In other words if you don't understand what this solution does, use the other solutions for clarity.

Recall that in pdflatex, the purpose of inputenc (utf8) package is to provide a mapping such as á → \'a. It would be rather redundant to provide it manually.

Instead, it's possible to access the mapping defined by utf8 inside the mapping itself.

The trivial way, specify value such as {á}{{á}}1 does not work because listings package redefines the meaning of the active characters in range 128-255 to do other things.

Instead, we can access the internal place that inputenc stores the definition: the control sequence named \u8:⟨the character in UTF8 encoding⟩. (note that this is internal implementation detail of the package, as such it's subject to change; nevertheless this particular part of the code is rather unlikely to change)

Example

\documentclass[12pt,a4paper]{scrbook}
\KOMAoptions{twoside=false,open=any,chapterprefix=on,parskip=full,fontsize=14pt}
\usepackage[portuguese]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\lstset{
    language=bash, %% Troque para PHP, C, Java, etc... bash é o padrão
    basicstyle=\ttfamily\small,
    numberstyle=\footnotesize,
    numbers=left,
    backgroundcolor=\color{gray!10},
    frame=single,
    tabsize=2,
    rulecolor=\color{black!30},
    title=\lstname,
    escapeinside={%}{)},
    breaklines=true,
    breakatwhitespace=true,
    framextopmargin=2pt,
    framexbottommargin=2pt,
    inputencoding=utf8,
    extendedchars=true,
    literate=
    {á}{{\csname u8:\detokenize{á}\endcsname}}1
    {ã}{{\csname u8:\detokenize{ã}\endcsname}}1
    {é}{{\csname u8:\detokenize{é}\endcsname}}1
    ,
}
\begin{document}
\begin{lstlisting}
<?php
echo 'Olá mundo!';
print 'áãé';
\end{lstlisting}
\end{document}

I don't think the algorithm that the listings package use is efficient enough to handle every Unicode characters defined by LaTeX, so stick with defining only used characters. (Pre-parsing the whole file to see which Unicode characters are used is also an option.)

One disadvantage of this method is that it will silently ignore Unicode characters that LaTeX does not define.

Having problems with listings and UTF-8. Can it be fixed?

7 Answers7

Please feel free to edit this list with more/missing characters!

Linked

Related