[This is part 1 of my answer.
Due to the limitation of the amount of characters within an answer I have to divide this answer into two parts.
This part contains a lot of explanations about how things work in LaTeX.
Part 2 contains a coding-example for a routine \UDCollectverbarg. ]
Seems you wish—on the basis of a set of tokens—to create .dvi- or .pdf-output or an external text file whose content looks like the tex source code which led to the coming into being of these tokens.
Due to the way in which LaTeX "digests" .tex-input/tex source code, it is not possible to exactly conclude from a set of tokens to the look of the tex source code due to which these tokens came into being.
This has to do with the ways in which LaTeX acts when reading/processing the tex source code for forming tokens:
LaTeX does read the tex source code line by line, processing each line character by character for forming tokens (character tokens, control sequence tokens) that are to be inserted into the token stream for further processing.
(There are two kinds of control sequence tokens:
Control word tokens are control sequence tokens whose names consist of a single character of category code 11(letter) or of several characters. E.g., \e and \LaTeX.
Control symbol tokens have names that consist of a single character which is not of category code 11(letter). E.g., \!, \?, \7. )
One of the first things LaTeX does to a line of input, even before starting producing tokens, is removing all space characters (the number of the code point of the space character is 32 both in UTF-8 and in ASCII which are the two possible internal character encodings of LaTeX) that are at the right end of it. After that it inserts at the right end of the line a character whose code point's number equals the value of the integer parameter \endlinechar. Usually the value of \endlinechar is 13 which is the number of the code point for the carriage return character in many encodings, e.g., in ASCII and in UTF-8. ASCII is the internal character encoding with old-school TeX engines. UTF-8 is the internal character encoding with more recent TeX engines like LuaTeX and XeTeX.
Then LaTeX switches the state of its reading apparatus to state N (new line).
(LaTeX has a reading apparatus. It can have one of three states:
State N: New line. This state indicates that LaTeX is starting to process another line of input.
State M: Middle of line. This state indicates that LaTeX is processing characters somewhere within a line of input.
State S: Skipping blanks. This state indicates that LaTeX shall skip characters whose category code is 10(space) rather than inserting an explicit space token (character code 32, category code 10(space)) into the token stream.)
Then LaTeX starts to look at the line, character by character.
In LaTeX each character has a so called category code.
The category code of a character influences what action LaTeX will perform when in the input encountering that character.
E.g., when LaTeX in the input finds a character of category code 0(escape), LaTeX will start gathering the name of a control sequence token from the subsequent characters on the current line and then insert that control sequence token into the token stream. Usually the backslash-character \ is the only character whose category code is 0(escape).
Right after finding such a character, LaTeX will be right at the start of gathering the name of a control sequence token.
In case the following character does not have category code 11(letter), LaTeX will take that next character for the name of a control symbol token and stop gathering and insert the corresponding control symbol token into the token stream and switch the reading apparatus to state M.
In case the following character does have category code 11(letter), LaTeX will take that next character for the first character of the name of a control word token and will keep on gathering (hereby being somewhere in the middle of gathering) until either reaching the end of the line or reaching the end of the file or encountering a character whose category code is not 11(letter) which then will not be considered part of the name of the control word token in question but will be considered something that needs to be looked at separately.
Then LaTeX will take the characters gathered so far for the name of the control word token and insert the corresponding control word token into the token stream and switch the reading apparatus to state S.
E.g., when LaTeX in the input finds a character of category code 11(letter) while not gathering the name of a control word token, LaTeX will insert a character token into the token stream whose category is 11(letter) and whose character code equals the number of the code point of that character in LaTeX's internal character encoding (which, depending on the underlying engine, either is ASCII or is UTF-8).
When LaTeX in the input finds a character of category code 11(letter) while gathering the name of a control word token, it will take that character for a part of the name of that control word token and keep on gathering.
E.g., when LaTeX in the input finds a character of category code 12(other) while not at the start of gathering the name of a control sequence token, LaTeX will insert a character token into the token stream whose category is 12(other) and whose character code equals the number of the code point of that character in LaTeX's internal character encoding (which, depending on the underlying engine, either is ASCII or is UTF-8) and will switch the reading apparatus to state M.
E.g., when LaTeX in the input finds a character of category code 12(other) while at the start of gathering the name of a control sequence token, LaTeX will take that character for the name of a control symbol token and will insert the corresponding control symbol token into the token stream and switch the reading apparatus to state M.
By the way:
LaTeX always switches the reading apparatus to state M after tokenizing and inserting into the token stream a non-space character token or a control symbol token whose name is not formed by a character of category code 10(space).
LaTeX always switches the reading apparatus to state S after tokenizing and inserting into the token stream a control-symbol-token whose name is formed by a character of category code 10(space) or an explicit space token or a control word token.
LaTeX always switches the reading apparatus to state N when starting to process another line of TeX-input.
E.g., when LaTeX in the input finds a character of category code 10(space)—usually the space character, code point 32 both in ASCII and in UTF-8, and the horizontal-tab character, code point 9 both in ASCII and in UTF-8, are the only characters of category code 10(space), there basically are two possibilities:
Possibility 1:
LaTeX might be at the start of gathering the name of a control sequence token: In this case LaTeX will take the space character for the name of that control sequence token and thus will insert the control symbol token \ (control space) into the token stream and switch the reading apparatus to state S.
Possibility 2:
LaTeX might not be at the start of gathering the name of a control sequence token.
In case it is somewhere in the middle of gathering the name of a control word token, the characters gathered so far will form the name of that control word token, and that control word token will be inserted into the token stream and the reading apparatus will be switched to state S.
Further action depends on the state of the reading apparatus:
In state S, LaTeX will ignore the space character, not inserting any token for it into the token stream. (Now you see why spaces in the input behind character sequences that lead to insertion of control word tokens into the token stream won't lead to the insertion of space tokens into the token stream.)
In state N, LaTeX will ignore that space character, not inserting any token for it into the token stream.
In state M, LaTeX will insert an explicit space token (character token of category 10(space) and character code 32) into the token stream.
In any case LaTeX will switch to state S after encountering a character of category code 10(space).
This implies that with several consecutive characters of category code 10(space), the ones that follow the first one will always be skipped, not yielding any token as they will always find the reading apparatus switched to state S due to the processing of their predecessors.
Another effect of this concept is that sequences of subsequent characters of category code 10(space) that in the tex source code occur at the beginnings of lines won't lead to the coming into being of whatsoever tokens: The first one will not yield whatsoever token due to the reading apparatus being in state N. The subsequent ones will not yield whatsoever token due to the reading apparatus being in state S.
This is why you usually can indent macro code and the like from the left by means of space characters and/or tab characters for improving the readability.
E.g., when LaTeX in the input finds a character of category code 14(comment) while right at the start of gathering the name of a control sequence token, LaTeX will insert the control symbol token whose name corresponds to that character into the token stream and switch the reading apparatus to state M. (Exception: In case that character was a space character, you'll get a control space \ and the reading apparatus will be switched to state S.)
When LaTeX in the input finds a character of category code 14(comment) while not right at the start of gathering the name of a control sequence token, LaTeX will stop processing the current line, and thus drop subsequent characters of that line. In case of being in the middle of gathering the name of a control word token, the characters gathered so far will form the name of that control word token and the corresponding control word token will be inserted into the token stream.
As LaTeX stops processing the current line, it will continue with starting processing the next line of input if present. When grabbing that next line for processing it, the state of the reading apparatus will be switched to state N. Usually % is the only character of category code 14(comment).
E.g., when LaTeX in the input finds a character of category code 5(end of line) while right at the start of gathering the name of a control sequence token, LaTeX will insert the control symbol token whose name corresponds to that character into the token stream and switch the reading apparatus to state M. (Exception: In case that character was a space character, you'll get a control space \ and the reading apparatus will be switched to state S.)
When LaTeX in the input finds a character of category code 5(end of line) while not right at the start of gathering the name of a control sequence token, LaTeX will stop processing the current line, and thus drop subsequent characters of that line. In case of being in the middle of gathering the name of a control word token, the characters gathered so far will form the name of that control word token and the corresponding control word token will be inserted into the token stream and the reading apparatus will be switched to state S.
The next action in this case depends on the state of the reading apparatus:
If in state N, the token \par will be inserted.
If in state M, an explicit space token (character token of category 10(space) and character code 32) will be inserted.
If in state S, no token will be inserted.
In any of these three cases LaTeX will then start processing the next line, hereby switching the reading apparatus to state N.
Above it was said, that LaTeX inserts a character according to the value of \endlinechar at each line-ending and that usually \endlinechar has the value 13 which denotes the code point of the carriage return character. Now the information is added that usually the carriage return character has category code 5. This implies that usually every line-ending is processed with a carriage return character of category code 5(end of line) at the end. Thus an empty line implies insertion of a carriage return character of category code 5(end of line) at the beginning of that empty line which in turn implies processing that inserted character while the reading apparatus is still in state N which in turn implies the insertion of a \par-token. That's why usually an empty line has the same effect as \par.
After this little excursus we see that a sequence of tokens does not necessarily resemble all the characters of the tex source code whose reading and tokenizing lead to the coming into being of that token sequence:
- In many situations spaces and tabs do not yield tokens at all.
- Empty lines might yield
\par-tokens while \string\par does not yield linebreaks but \, p, a, r.
- Characters of category code 14(comment) do not yield tokens at all.
- Characters of category code 9(ignore) do not yield tokens at all.
Also, the output of the \string-primitive does not necessarily resemble the tex source code:
If \string is applied to an explicit character token, the result will be a character token of equal character code but—in case of the character token where \string is applied to not having character code 32 (32 is the number of the space character's codepoint)—of category 12(other) or—in case of the character token where \string is applied to having character code 32—of category 10(space).
If \string is applied to a control sequence token, the result will be a sequence of character tokens:
In case the integer parameter \escapechar has a positive value within the range of the code points of the internal character encoding of the engine in use, a character token will be delivered whose character code equals the value of \escapechar and whose category is 12(other) (exception: In case of \escapechar denoting the space character, the category is 10(space) ). Usually \escapechar has the value 92 which is the number of the coding point of the backslash character. Then a sequence of character tokens follows, each of them denoting a character of the name of the control sequence token, the character code being the number of the code point of that character in LaTeX's internal character encoding, the category being 12(other) (in case of the code point in question not denoting the space character) or 10(space) (in case of the code point denoting the space character).
If \string is applied to the nameless control sequence token, the result will be the sequence \csname\endcsname.
Therefore I suggest going the opposite direction:
Create a macro which switches to verbatim category code régime and then gathers its argument via having LaTeX read and tokenize input from the file containing your tex source code.
"Switching to verbatim category code régime" means changing category codes of input characters in a way which leads to each character of that snippet of tex source code that forms the argument—after reading and tokenizing—having a counterpart in terms of a character token.
Under verbatim category code régime, e.g., the backslash does not have category code 0(escape) but does have category code 12(other) and thus does not lead to gathering the name of a control sequence token but does yield a backslash character token of category code 12(other). This way no control sequences will come into being while gathering the argument. Only character tokens will come into being.
Under verbatim category code régime, e.g., the space character does not have category code 10(space) but does have category code 12(other). Thus it will be treated as an ordinary thing whereafter the reading apparatus is not switched to state S but is switched to state M resulting in consecutive spaces not "collapsing into a single space token".
When LaTeX reads and tokenizes input under verbatim category code régime, then each character of the snippet of tex source code that forms the input in question will after reading and tokenizing have a counterpart in terms of a character token.
You can "feed" an argument that got tokenized this way to the \scantokens-primitive.
The \scantokens-primitive comes along with the eTeX extensions.
The \scantokens-primitive lets LaTeX act as if it would unexpanded-write the tokens that form its ⟨balanced text⟩ to external file and then load that file via \input.
During the latter part of that action, the \input-part, things get (re)tokenized under the category code régime which is in effect while \scantokens is carried out. If that is the normal category code régime, the (re)tokenization also may yield control sequence tokens etc.
During the further part, the \write-part, all the nice subtle rules apply that always apply to TeX's writing of tokens to external text file/screen. (Things like hash doubling. Things like character tokens whose character code equals the value of the integer parameter \newlinechar causing LaTeX to continue writing subsequent things at the beginning of a new line in the external text file/on the screen.)
A macro which switches to verbatim category code régime and then gathers its argument—how should you use such a macro?
As the macro is intended to collect arguments that are to be tokenized under verbatim category code régime, it should be possible to have it collect both arguments where opening curly braces and closing curly arguments are balanced and arguments where these braces are not balanced.
In the further case the same syntax can be applied as with any ordinary mandatory argument—i.e., just nest the argument inside curly braces.
In the latter case the syntax of LaTeX's \verb-macro needs to be applied where you use a character for delimiting the argument which does not occur inside the argument.
Thus it would be a nice idea to have the macro detect whether the very first token of the text that is to be read and tokenized under verbatim category code régime is a curly opening brace or is not a curly opening brace. If it is, collect the remaining tokens of that argument while applying the syntax of the further case. If it is not, collect the remaining tokens of that argument while applying the syntax of the latter case.
Henceforth an argument that is to be tokenized under verbatim category code régime will be called
a ⟨verbatimized argument⟩, no matter whether it is to be gatherted applying the further or the latter syntax.
Another issue with handling a ⟨verbatimized argument⟩ is the treatment of the ends of lines:
Above the \endlinechar-thingie was mentioned:
Above it was said, that LaTeX inserts a character according to the value of \endlinechar at each line-ending and that usually \endlinechar has the value 13 which denotes the code point of the carriage return character. Then the information was added that usually the carriage return character has category code 5(end of line) which depending on the state of the reading apparatus may lead to the skipping of the "endline character" inserted or may lead to the coming into being of \par-tokens or explicit space tokens.
When you type text, i.e., when you create tex source code, the carriage return character does not occur in the middle of lines. It is a character which for the software which you use for creating/typing/viewing the tex source code, denotes the end of a line.
Under normal category code régime, where the ^-character has category code 7(math superscript), you can in tex source code apply ^^-notation for denoting some characters that cannot easily be typed on a keyboard.
Under normal category code régime you can in TeX source code use the character-sequence ^^M for denoting the return character. (M is the 13th letter in the alphabet; carriage return character has code point number 13...)
^^-notation will be transformed while reading and processing the line of input where it occurs, even before putting tokens into the token-stream and even while gathering names of control sequence tokens.
^^-notation is not available under verbatim category code régime, as under that régime the category code of ^ is switched to 12(other).
More exact explanations related to ^^-notation can be found in the TeXbook.
A macro for gathering a ⟨verbatimized argument⟩ could switch the category code of the carriage return character to 12(other). This way you get explicit carriage return character tokens for the ends of lines. (An explicit carriage return character token is a character token of category 12(other) and character code 13. 13 is the number of the code point of the carriage return character both in ASCII and in UTF-8, the possible internal character-encodings with (La)TeX-engines.)
This way you get explicit carriage return character tokens only for the ends of lines.
Thus in LaTeX within a ⟨verbatimized argument⟩ carriage return character tokens can only come into being due to the \endlinechar-thingie, and thus they can be used for denoting places where in the tex source code line-endings occurred.
Thus a nice feature of a macro that gathers a ⟨verbatimized argument⟩ would be an additional argument where you can provide tokens by which explicit carriage return character tokens (which can only come into being at line-endings, due to the \endlinechar-thingie) shall be replaced.
You could use this, e.g., for having explicit carriage return character tokens replaced by explicit line feed character tokens. (An explicit line feed character token is a character token of category 12(other) and character code 10. 10 is the number of the code point of the line feed character both in ASCII and in UTF-8, the possible internal character-encodings with (La)TeX-engines. J is the 10th letter in the alphabet. In tex source code you can under normal category code régime denote the line feed character via ^^J.)
Why would you do this? The reason has to do with LaTeX's integer parameter \newlinechar: When LaTeX does write tokens unexpanded to a text file or the screen, it will create linebreaks for explicit character tokens whose character codes equals the value of \newlinechar rather than writing the corresponding characters to file.
Usually the value of \newlinechar is 10. Thus usually, when writing tokens unexpanded to a text file or the screen, explicit line feed character tokens can be used for denoting places where LaTeX shall continue writing at the beginning of a new line.
When you intend to have passed the ⟨verbatimized argument⟩ to \scantokens, you can use this feature for having replaced all the explicit carriage return character tokens that denote line-endings by explicit line feed character tokens. The effect will be that when \scantokens does its unexpanded-writing-part, it will "write" line-breaks at these places.
[This link leads to part 2 of my answer.]
\scantokensfor re-tokenizing in situations where you do not need the stringified/verbatimized variant but the normal-catcode-régime-variant? – Ulrich Diez Jan 12 '19 at 11:10\verb-command? It does change the category codes before reading and tokenizing its argument so that the argument gets tokenized as a set of characters, no control-sequences. No loss of space-characters. It is possible to create a similar command. In situations where you don't want the characters but the control-sequence-tokens, you can pass the characters to\scantokenswhich acts as if things were written to external file before\inputting that file under normal catcode-régime where you get control-sequences etc... In case of interest I can elaborate on that. – Ulrich Diez Jan 12 '19 at 20:42