Read file contents to variable and iterate over each character in file (hexdump)

Question

I need to read file content as string (binary file) to variable, then iterate over each character in that string, convert each character to its hexadecimal representation, use comma (or any other distinguishable character) as character separator and embed the result in my actual LaTeX file where I'm doing all that stuff.

In python I have something like this (4 lines!):

fileContent = ""
with open("binary.so", "rb") as f:
    fileContent = f.read()
print( ",".join( map(lambda x: str(ord(x)), fileContent) ) )

Now the hard part:

no extra packages = only tex primitives
defined as tex macro: \readfile{ input_file_here }
must work with pdflatex
file might be empty
newlines and other special characters must be preserved (it is binary!)
I need a check if it is really a file and not a directory and if I'm allowed to read such files
should work on Linux/Unix and Windows (actually I need only for Linux, but more general solution would be nice)
no extra definitions in preamble
no luatex, no immediate, so no shell and no extra tools

Because this is a binary file (I need to treat all files like this) we have character values between 0 (0x00) and 255 (0xFF). Numbers might be HEX or just decimal. base64 encoding would be OK too, but without external tools and extra flags like --shell-escape.

\readfile{binary.so} % will return something like "D,E,A,D,B,E,E,F,1,3,3,7" or "0x0D,0x0E,..." or "13,14,..."

I tried something like this and it didn't work: (example taken from here)

\newread\makerfile
\openin\makerfile=binary.so
\ifeof\makerfile
\else
    % it looks promising, but it shouldn't read by newlines
    \read\makerfile to\makerline
    \closein\makerfile
    % iterate over each character
    % ...
\fi

E.g. this works, but I cannot iterate over characters, because I don't know how

\makeatletter
\newcommand\saferead[1] {
  \bgroup
  \let\do\@makeother
  \dospecials
  \catcode`\ =10 % spaces
  \catcode`\^^M=\active % newlines
  \input{#1}
  \egroup 
}
\makeatother

I googled like crazy and found only some useful bits but I'm still far away from what I want.

Welcome to TeX.SX! You say "only native TeX code": what do you mean by that? Only TeX primitives? Only plain TeX? Only LaTeX? Why no packages? From what I have already seen, the no packages restricition usually causes more trouble than it solves. Also, why no \immediate? Do you mean the commands to be executed only when TeX ships a page out? — Phelype Oleinik, Aug 26 '19 at 22:19
Do you have a good reason for that restriction? Also, what do you mean by "no extra definitions"? The command you ask for will itself be a definition, will it not? — Phelype Oleinik, Aug 26 '19 at 22:21
I need to show in my thesis (about penetration testing in web security. we have PHP and pdflatex installed) a PoC for LaTeX-injection. I was able to read global variables like $PATH, $$ etc. (invoked without --shell-escape flag for security reasons). I'm able to read arbitrary ascii files, and also write arbitrary tex in temp. There are multiple injection points in the middle of the document. — Awaaaaarghhh, Aug 26 '19 at 22:24
That's why I have such hard restrictions. So no extra packages. No verbatim package, etc. Well, no definitions in preamble so where we include our packages before \begin{document}. So only between \begin{document} and \end{document}. — Awaaaaarghhh, Aug 26 '19 at 22:27
vice versa there is another case in building binary files from scratch (and writing them to filesystem (so reverse operation as written here)) with support of latex which would be then included as phar://binary-generated-with-latex and abuse use-after-free/buffer overflow vulnerabilities in php. latex is used of course server-side to generate PDFs. — Awaaaaarghhh, Aug 26 '19 at 22:29
I mean I know that are hard restrictions. On the other hand lots of packages are written somehow in the same way, I read the source code of multiple file operation packages, but I'm still a bit clueless. Despite the vulnerabilities I'm a bit curious how to accomplish that task. So even not a full solution is required, but it would be OK if you could show some code snippets how to achieve that. — Awaaaaarghhh, Aug 26 '19 at 22:31
As the answers already posted point out, using the 'classical' TeX primitives, there are some things that are never going to be possible with a binary. TeX normalises line ends and such before the macro layer ever sees anything. Is a 'TeX90' solution what you are after or is using a pdfTeX primitive allowed? — Joseph Wright, Aug 27 '19 at 06:39
This github page was my initial motivation (something like this was used on some CTF): https://github.com/swisskyrepo/PayloadsAllTheThings/tree/master/LaTeX%20Injection but later I discovered this paper: https://hovav.net/ucsd/dist/texhack.pdf Then I was just curious if it would be possible to do a hexdump with tex primitives only. — Awaaaaarghhh, Aug 27 '19 at 23:00
@Awaaaaarghhh "primitives only" is tricky. Anything in TeX is either primitive or macro. If it's a macro it expands to other things, and these things are either primitives or more macros. Eventually all macros expand and you are left with primitives only. You could say that everything can be done with primitives. Of couse I understand your request, and siracusa's answer is what comes closest to it (the only non-primitive it uses is \loop...\repeat). I made a faster (comparing to my previous answer) version which sees all characters. But it's as far from "primitives only" as it can get :-) — Phelype Oleinik, Aug 29 '19 at 13:47
@PhelypeOleinik the reason I have so many restrictions is that I need to show that I can read any file be it binary or just a regular text file. I have my LaTeX template where I can insert some strings which come from database. It happens inside \begin{document} ... \end{document}. My job is to show that server side LaTeX injection is a real threat. I was able to read some simple ASCII files however I was not able to read arbitrary binary files because I would have to use some extra packages (in preamble). That's why the restriction not to use any packages and only tex primitives. — Awaaaaarghhh, Aug 30 '19 at 17:45

siracusa · Accepted Answer · 2019-08-27T01:52:47.837

As Heiko Oberdiek points out in this answer, pdfTeX defines a new, expandable primitive \pdffiledump which can be used to read files in binary mode. The syntax of the command is

\pdffiledump offset 0 length <length>{<filename>}

where for <length> we can use another primitive \pdffilesize{<filename>}. The result is a sequence of pairs XX, where XX is the hex representation of each character in the input file. The rest of the processing is similar to the answer below, beside we don't need the extra hex conversion.

\documentclass{article}

\makeatletter

\def\showbinary#1{%
    \begingroup
    \xdef\@temp{\pdffiledump offset 0 length \pdffilesize{#1}{#1}}%
    \expandafter\analyze\expandafter{\@temp}%
    \endgroup
}

\def\analyze#1{%
    \count@=0
    \if\relax\detokenize{#1}\relax\else
        \expandafter\analyze@#1\@end
    \fi
}
\def\analyze@#1#2#3\@end{%
    #1#2
    \advance\count@ by 1
    \ifnum\count@>15
        \count@=0
        \par
    \fi
%
    \let\@next=\relax
    \if\relax\detokenize{#3}\relax\else
        \def\@next{\analyze@#3\@end}%
    \fi
    \@next
}

\makeatother

\begin{document}
\ttfamily
\showbinary{ascii.txt}
\end{document}

outputs

Old answer

Not a complete answer, but this is the best I could come up with for reading a binary file:

\documentclass{article}

\makeatletter

\def\showbinary#1{%
    \begingroup
    \count@=0
    \loop
        \catcode\count@=12
        \advance\count@ by 1
    \ifnum\count@<256
    \repeat
%
    \endlinechar=-1
    \everyeof{\noexpand}%
    \xdef\@temp{\@@input #1 }%
%
    \analyze\@temp
    \endgroup
}

\def\analyze#1{%
    \expandafter\analyze@#1\@end
}
\def\analyze@#1#2\@end{%
    \count@=`#1\relax
    \expandafter\hex\expandafter{\the\count@}
    \let\@next=\relax
    \if\relax\detokenize{#2}\relax\else
        \def\@next{\analyze@#2\@end}%
    \fi
    \@next
}

\def\hex#1{%
    \begingroup
    \count@=#1\relax
    \divide\count@ by 16
    \hexchar\count@
%
    \multiply\count@ by 16
    \advance\count@ by -#1\relax
    \multiply\count@ by -1
    \hexchar\count@
    \ifnum\count@=15\par\fi
    \endgroup
}
\def\hexchar#1{%
    \ifcase#10\or1\or2\or3\or4\or5\or6\or7\or8\or9\or A\or B\or C\or D\or E\or F\else x\fi
}

\makeatother

\begin{document}
\ttfamily
\showbinary{ascii.txt}
\end{document}

outputs

ascii.txt is a binary file including all characters from 0x00 to 0xFF. First all those characters are set to catcode 12 (other), then the file is \input'ed and its contents stored in a macro \@temp. Afterwards we iterate over each character in \@temp to output its hex representation.

As you can see, three characters are missing: 0x09 (\t), 0x0A (\n) and 0x0D (\r). The latter two are likely because TeX files are read in text mode and not in binary mode. Not sure if something can be done about that. The tab character is missing in this particular test file because TeX treats tabs like spaces when they occur at the end of a line (the \t is immediately followed by \n) and thus removes it from the input line.

I accepted as an answer, because it doesn't require some extra packages and seems to work. I took a random binary file with ~355KB (363.952 Bytes) and used it as input (for your first solution: pdffiledump). It seems to be slow - I started the job 1.5h ago and it still still not ready :D But it works, because I see PDF file and test.synctex(busy) are growing in size — Awaaaaarghhh, Aug 27 '19 at 22:34
second solution seems to be a bit faster in reading binaries — Awaaaaarghhh, Aug 27 '19 at 22:44
Really advanced code (yours, egreg and @Phelype Oleinik). Thank you! Is there a way to detect newlines and tabs and somehow insert them on the flow? — Awaaaaarghhh, Aug 27 '19 at 22:55
@Awaaaaarghhh The tab removal in the old answer seems to depend on the used compiler, which seems like a bug to me. However, the new version doesn't have this problem, it detects all 8-bit characters correctly. — siracusa, Aug 27 '19 at 23:34

Phelype Oleinik · Answer 2 · 2019-08-29T13:42:26.013

You're using LaTeX, not Plain, so it doesn't make much sense to not use packages. With a bit of expl3 code you can make yourself a proper hexdump of a file.

My previous answer (see edit history) used a rather simple expl3 code to read in the file and make a hexdump out of it. However the code was rather slow (it took about 60 seconds to produce 7 pages of hexdump of a 6 kB file).

I made a slightly optimized version (takes about half a second to process the same file :-) with a few more niceties: it's faster, it has some key-val properties to control the output, it's faster, it uses \pdf@filedump from pdftexcmds to avoid losing line feeds and spaces, and it's much faster :-)

Here it is:

\documentclass{article}
\usepackage{pdftexcmds}
\usepackage{xparse}
\ExplSyntaxOn
\cs_new_eq:Nc \__hexdump_filedump:nnn { pdf@filedump }
\cs_new_eq:Nc \__hexdump_filesize:n { pdf@filesize }
\int_new:N \l__hexdump_begin_int
\int_new:N \l__hexdump_bytes_int
\int_new:N \l__hexdump_filesize_int
\int_new:N \l__hexdump_byte_int
\int_new:N \l__hexdump_byte_ptr_int
\int_new:N \l__hexdump_word_int
\int_new:N \l__hexdump_word_ptr_int
\int_new:N \l__hexdump_column_int
\int_new:N \l__hexdump_column_ptr_int
\int_new:N \l__hexdump_line_length_int
\int_new:N \l__hexdump_address_size_int
\int_new:N \l__hexdump_address_int
\bool_new:N \l__hexdump_address_bool
\tl_new:N \l__hexdump_dump_tl
\tl_new:N \l__hexdump_font_tl
\tl_new:N \l__hexdump_visible_tl
\clist_new:N \l__hexdump_cols_clist
\seq_new:N \l__hexdump_cols_seq
\cs_generate_variant:Nn \str_count:n { f }
\keys_define:nn { hexdump }
  {
    , begin   .int_set:N   = \l__hexdump_begin_int
    , begin   .initial:n   = { 0 }
    , length  .int_set:N   = \l__hexdump_bytes_int
    , length  .initial:n   = { -1 }
    , byte    .int_set:N   = \l__hexdump_byte_int
    , byte    .initial:n   = { 2 }
    , columns .clist_set:N = \l__hexdump_cols_clist
    , columns .initial:n   = { 4, 4 }
    , font    .tl_set:N    = \l__hexdump_font_tl
    , font    .initial:n   = \ttfamily
  }
\NewDocumentCommand \hexdump { o m }
  {
    \group_begin:
      \IfValueT {#1} { \keys_set:nn { hexdump } {#1} }
      \hexdump:n {#2}
    \group_end:
  }
\cs_new_protected:Npn \hexdump:n #1
  {
    \file_if_exist:nTF {#1}
      { \__hexdump_read:n {#1} }
      { \msg_error:nnn { hexdump } { file-not-found } {#1} }
  }
\cs_new_protected:Npn \__hexdump_read:n #1
  {
    \int_set:Nn \l__hexdump_filesize_int { \__hexdump_filesize:n {#1} }
    \__hexdump_assert_int:Nnn \l__hexdump_begin_int
      { \c_zero_int } { \l__hexdump_filesize_int }
    \int_compare:nNnT { \l__hexdump_bytes_int } = { -1 }
      { \int_set:Nn \l__hexdump_bytes_int { \l__hexdump_filesize_int } }
      {
        \__hexdump_assert_int:Nnn \l__hexdump_bytes_int
          { \c_zero_int } { \l__hexdump_filesize_int }
      }
    \tl_set:Nx \l__hexdump_dump_tl
      {
        \__hexdump_filedump:nnn
          { \l__hexdump_begin_int } { \l__hexdump_bytes_int }
          {#1}
      }
    \tl_map_function:nN { \. \? \! \: \; \, } \__hexdump_french_spacing:N
    \tl_use:N \l__hexdump_font_tl
    \__hexdump:N \l__hexdump_dump_tl
  }
\cs_new_protected:Npn \__hexdump_french_spacing:N #1
  { \char_set_sfcode:nn { `#1 } { 1000 } }
\cs_new_protected:Npn \__hexdump_assert_int:Nnn #1 #2 #3
  { \int_set:Nn #1 { \int_min:nn { \int_max:nn { #1 } { #2 } } { #3 } } }
\msg_new:nnn { hexdump } { file-not-found }
  { File~`#1'~not~found. }
\cs_new_protected:Npn \__hexdump:N #1
  {
    \__hexdump_initialise:
    \exp_last_unbraced:NV \__hexdump:NNw #1
      \q_recursion_tail \q_recursion_tail \q_recursion_stop
  }
\cs_new_protected:Npn \__hexdump_initialise:
  {
    \seq_set_from_clist:NN \l__hexdump_cols_seq \l__hexdump_cols_clist
    \int_set:Nn \l__hexdump_word_int { \seq_item:Nn \l__hexdump_cols_seq { 1 } }
    \int_set:Nn \l__hexdump_column_int { \seq_count:N \l__hexdump_cols_seq }
    \int_set:Nn \l__hexdump_address_size_int
      { \str_count:f { \int_to_hex:n { \l__hexdump_bytes_int } } }
    \int_set_eq:NN \l__hexdump_address_int \l__hexdump_begin_int
    \int_set:Nn \l__hexdump_line_length_int
      { \l__hexdump_byte_int * ( \seq_use:Nn \l__hexdump_cols_seq { + } ) }
    \exp_args:NNf \seq_put_right:Nn \l__hexdump_cols_seq
      { \seq_item:Nn \l__hexdump_cols_seq { 1 } }
    \bool_set_true:N \l__hexdump_address_bool
    \int_zero:N \l__hexdump_byte_ptr_int
    \int_zero:N \l__hexdump_word_ptr_int
    \int_zero:N \l__hexdump_column_ptr_int
  }
\cs_new_protected:Npn \__hexdump:NNw #1 #2
  {
    \quark_if_recursion_tail_stop_do:Nn #1
      { \__hexdump_end: }
    \bool_if:NT \l__hexdump_address_bool { \__hexdump_address: }
    #1 #2
    \tl_put_right:Nx \l__hexdump_visible_tl
      {
        \__hexdump_if_visible_ascii:nTF { "#1#2 }
          { \char_generate:nn { "#1#2 } { 12 } }
          { . }
      }
    \__hexdump_ptr_check:
    \__hexdump:NNw
  }
\cs_new_protected:Npn \__hexdump_ptr_check:
  {
    \__hexdump_ptr_step:nn { byte }
      {
        \c_space_tl
        \__hexdump_ptr_step:nn { word }
          {
            \int_set:Nn \l__hexdump_word_int
              {
                \seq_item:Nn \l__hexdump_cols_seq
                  { \l__hexdump_column_ptr_int + 2 }
              }
            \c_space_tl
            \__hexdump_ptr_step:nn { column }
              { \tex_unskip:D \__hexdump_dump_visible: }
          }
      }
  }
\cs_new_protected:Npn \__hexdump_ptr_step:nn #1 #2
  {
    \int_incr:c { l__hexdump_#1_ptr_int }
    \int_compare:nNnT
        { \int_use:c { l__hexdump_#1_ptr_int } }
          =
        { \int_use:c { l__hexdump_#1_int } }
      {
        \int_zero:c { l__hexdump_#1_ptr_int }
        #2
      }
  }
\prg_new_protected_conditional:Npnn \__hexdump_if_visible_ascii:n #1 { TF }
  {
    \int_compare:nNnTF {#1} > {31}
      {
        \int_compare:nNnTF {#1} < {127}
          { \prg_return_true: }
          { \prg_return_false: }
      }
      { \prg_return_false: }
  }
\cs_new_protected:Npn \__hexdump_address:
  {
    \bool_set_false:N \l__hexdump_address_bool
    \exp_args:Nf \__hexdump_address:nn
      { \str_count:f { \int_to_hex:n { \l__hexdump_address_int } } }
      { \l__hexdump_address_size_int }
    \int_add:Nn \l__hexdump_address_int { \l__hexdump_line_length_int }
  }
\cs_new_protected:Npn \__hexdump_address:nn #1 #2
  {
    \prg_replicate:nn { #2 - #1 } { 0 }
    \int_to_hex:n { \l__hexdump_address_int } : ~
  }
\cs_new_protected:Npn \__hexdump_dump_visible:
  {
    | \tl_use:N \l__hexdump_visible_tl |
    \tl_clear:N \l__hexdump_visible_tl
    \bool_set_true:N \l__hexdump_address_bool
    \tex_par:D
  }
\cs_new_protected:Npn \__hexdump_end:
  {
    \bool_if:NF \l__hexdump_address_bool
      {
        \c_space_tl \c_space_tl
        \tl_put_right:Nn \l__hexdump_visible_tl { ~ }
        \__hexdump_ptr_check:
        \__hexdump_end:
      }
  }
\ExplSyntaxOff
\begin{document}
\hexdump{somebinary.file}
\end{document}

Visible bytes (ASCII 32 – 126) are printed, and everything else is represented by a . in the right pane:

I guess I need to add an interface for file dumps ... – Joseph Wright Sep 12 '19 at 17:05 — Joseph Wright, Sep 12 '19 at 17:05

score 3 · Answer 3 · answered Aug 27 '19 at 20:42

Here's an expandable version (using two “forbidden functions”), using ideas of siracusa.

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn

\NewExpandableDocumentCommand{\hexdump}{O{~}m}
 {
  \awa_hexdump:ne {#1} { \tex_filedump:D~offset~0~length~\tex_filesize:D{#2}{#2} }
 }
% there's not yet an official interface to \pdffiledump and \filesize

\cs_new:Nn \awa_hexdump:nn
 {
  \__awa_hexdump_read_byte:nNNN {#1} #2 \q_nil \q_stop
 }
\cs_generate_variant:Nn \awa_hexdump:nn { ne }

\cs_new:Nn \__awa_hexdump_read_byte:nNNN
 {
  \quark_if_nil:nTF { #4 }
   % true: print the last two digits and ignores the trailer
   { #2#3 \use_none:n }
   % false: print two digits, a comma and some space
   { #2#3#1 \__awa_hexdump_read_byte:nNNN { #1 } #3 }
 }

\ExplSyntaxOff

\begin{document}

\raggedright\ttfamily
\hexdump{cmr10.tfm}

\hexdump[,\hspace{0pt plus 1fill}]{\jobname.tex}

\end{document}

I used a copy of the standard cmr10.tfm file. The optional argument (default a space) is for controlling the delimiter between two bytes.

The picture shows the last two lines from the first call and the first two lines from the second call.

A check for the existence of the file can be easily added.

Read file contents to variable and iterate over each character in file (hexdump)

3 Answers3

Linked