1

Problem Description

As in the title.

MWE:

\documentclass{article}
\begin{document}

% first write some file content for demo... \begin{filecontents}{test test.txt} line 1 !?#\xyz line 2 \end{filecontents}

% what to do here? I want to read the whole content of the file into variable \result

\ExplSyntaxOn \str_show:N \result % → the whole content of the file. \ExplSyntaxOff

\end{document}

itc
  • 657
user202729
  • 7,143
  • I found several very-similar questions (see links below), but none that is exactly identical to this one. – user202729 Jun 10 '22 at 13:25
  • Side note, if the file is huge, you may want to refrain from using this method. – user202729 Jun 10 '22 at 13:26
  • TeX doesn't work with strings, so "string variable" in nonsense from TeX point of view. There are only token lists in TeX. – wipet Jun 10 '22 at 17:27
  • 1
    @wipet The question title contains “(expl3) string variable”. String variables are perfectly well defined in expl3. The question didn't mention that computational complexity was a concern, but its “string variable” part, at least, is clear. In case it is not for you, see “The l3str package: Strings” in interface3.pdf. – frougon Jun 10 '22 at 19:03
  • @frougon Because expl3 is based on TeX and TeX doesn't work with strings, something like "string variable" is only mystification and denial of TeX principles. This can bring only misunderstanding. Moreover \string primitive is something different, so misunderstanding is very probable. – wipet Jun 10 '22 at 19:26
  • @wipet There is no catcode ambiguity in an expl3 string, and only character tokens. I bet you didn't read the first 8 lines of the chapter I pointed you to. Which means, I am wasting my time. – frougon Jun 10 '22 at 19:30

2 Answers2

5

One option is to use \file_get:nnN. There are other options as well.

%! TEX program = pdflatex
\documentclass{article}
\begin{document}

\begin{filecontents}{test test.txt} line 1 !?#\xyz line 2 \end{filecontents}

\ExplSyntaxOn

\file_get:nnN {test~test.txt} { % alternatively (although the effects the different on LuaTeX) % \int_step_inline:nnn {0} {255} {\char_set_catcode_other:n{#1}} \cctab_select:N \c_other_cctab \endlinechar=10~ % ↑ this must be done after the cctab line % because cctab changes the value of endlinechar as well } \result \str_set:NV \result \result

\str_show:N \result % → the whole content of the file. \ExplSyntaxOff

\end{document}

This saves the whole content of the file into variable \result, each "new line character" in the file will be represented with a character 10.

Explanation

\file_get:nnN {test~test.txt} {

Use the command, as explained. Note that since this is expl3 environment the space need to be written as ~.

    \cctab_select:N \c_other_cctab

Set the catcode. On XeTeX it's impractical to set for all the characters in 0..1114111, so there exists corner cases where one of the characters in the file has catcode 1/2 and make the content unbalanced thus cause an error.

Note that this command, in addition to setting the catcode table as its name suggests, it also sets the endlinechar (in this catcode table endlinechar = -1), as such we want char with charcode 10 to separate the lines we need to explicitly set it below.

    \endlinechar=10~

Set endlinechar to 10, so each new line character in the file will be represented with a character 10 to be precise, a token with char code 10 and catcode 12 in the resulting string.

This is expl3 environment it's good practice to specify the space explicitly to terminate the number (or use \scan_stop: / \relax but I don't like that name it's longer. There's also \int_set:Nn \endlinechar {10} but that... relies on implementation details...?)

} \result
\str_set:NV \result \result

Detokenize the result. This step is important, because in order to get the result as a string, tokens with charcode 32 (space) should have catcode 10 (space), while it would have catcode 12 (other) as above.

Limitations

  • First (for XeTeX/LuaTeX engine only), if there happens to be some character with char code ≥ 256 and some unusual catcode (e.g. catcode 1 -- open brace) it might break. Most of the time the detokenization can handle it however.
  • If the file to be read is a TeX file, synctex data on that file might be "lost". (the details is complex.)
  • Trailing spaces on each line are stripped (and maybe trailing tabs as well).

Related questions

user202729
  • 7,143
  • Side note, the str_show might print something that doesn't make sense depends on your operating system, because character 10 is a CR or LF, something like that. – user202729 Jun 10 '22 at 13:32
  • In the paragraph starting with “Detokenize the result”, there is a mix between char code and catcode (swap needed). 2) \str_show:n should be \str_show:N. 3) I get “Undefined control sequence” for \pretty:V. I can fix these if you want...
  • – frougon Jun 26 '22 at 21:04
  • @frougon Ah, that \pretty thing is an internal debugging macro for me, I keep forget to deleting them. Otherwise good points. – user202729 Jun 27 '22 at 00:10
  • Would be nice to make a package out of this code – yegor256 Oct 07 '22 at 05:34