What are tokens used for?

Question

When reading TeX for the impatient, we learn that:

The "eyes" of TeX convert the input file into a sequence of characters;
Then the "mouth" of TeX converts the character sequence into tokens;
Then that "the esophagus" develops the tokens which are sequences of primitive commands;
Finally "the stomach" performs the operations indicated by the commands.

I have never seen other languages using this token system. It is not used by Pascal, C, PHP, SQL, LISP, etc.

Why does TeX need tokens?
What are they for ?
Why do other computer languages do without problems?

This question is not duplicated because it ultimately asks why TeX uses tokens, for what purpose, what is their purpose, what are the problems that justify their use. This system of token has a purpose, a motivation, a finality, attached concepts, which?

Related is I think https://tex.stackexchange.com/questions/16410/what-are-category-codes — Joseph Wright, Jul 31 '17 at 18:25
@Joseph Wright, the question I ask myself is not the how, but for the why: what is the (conceptual or no) problem that requires the use of tokens and which can not be solved without it. — AndréC, Jul 31 '17 at 18:59
@AlanMunn yes "token" is fairly standard but it gets lost in all the digestive tract references, — David Carlisle, Jul 31 '17 at 19:56
@DavidCarlisle True, once it gets into the digestive tract, it all turns to ... — Alan Munn, Jul 31 '17 at 20:13
I think that at least partly related here is the fact that TeX's model has the code and data interspersed: one might therefore need to alter how the code is parsed depending on the nature of the data (at least in principle). — Joseph Wright, Aug 02 '17 at 16:00
One could argue that C's preprocessor macros are somewhat similar. — Skillmon, Dec 11 '18 at 18:04

score 20 · Answer 1 · answered Jul 31 '17 at 20:14

20

There are some technical differences but mostly it's just the "idiosyncratic" terminology used by D. Knuth, almost all systems are parsed in stages that separate lexical analysis (eg distinguishing a name from a numeric literal) from parsing itself (identifying the program structure). tokens are the output of lexical analysis.

TeX's version is fairly unique is that you can change the lexical analysis during the run so in tex, \foo@bar might be a single token (\foo@bar) or 5 (\foo,@,b,a,r) or 8 (\,f, o, o, @, b, a, r) of tokens depending on the catcode of \ and @ (or other numbers) of tokens for more exotic catcode settings) whereas most languages use a fixed tokenisation.

This dynamic aspect of tokenisation in TeX means that it plays a far more visible role, in C or java it is just assumed and usually left unsaid that an expression such as 1+abc is three tokens 1,+, abc, and it does not depend on some run time values whether abc is a single token representing a variable or two tokens a and bc juxtaposed.

answered Jul 31 '17 at 20:14

David Carlisle

757,742

1

Why does TeX need to modify the lexical analysis during its execution? What are the problems solved by this dynamic functioning of lexical analysis? – AndréC Jul 31 '17 at 21:59
it's just the way it is, it's like saying why does java need to be object oriented, or why does C need pointers. There is no fixed syntax in TeX, \ is not special nor is { or }, so for example xmltex is a tex format where \ { and } are just normal printing characters, but < and & are special syntax characters, as in XML. @AndréC – David Carlisle Jul 31 '17 at 22:02
2

TeX doesn't need to change the lexical analysis during execution; it only allows hackers to do it. Some package writers find this feature useful, so they may develop LaTeX, Texinfo or StarTeX as markup languages on top of TeX. Others find it amusing and then write stuff like xii.tex. :P – jarnosc Jul 31 '17 at 22:21
2

@erreka I had avoided mentioning xii.tex :-) – David Carlisle Jul 31 '17 at 22:24
3

@AndréC: It's a feature. There are other languages that allows one to change the lexer/parser. This is sometimes also called meta programming. An interesting example is racket, which allows you to define and use new languages! – Aditya Aug 01 '17 at 03:18
@David Carlisle, the question is not why a language is object-oriented, but what problems can solve object-oriented languages. Similarly, I can not believe that tokens are just a possibility for developers. This system has a purpose, a motivation, a finality, attached concepts, which? – AndréC Aug 01 '17 at 07:00
2

@AndréC I think mainly it's as I say above the desire to have free syntax to add markup to different filetypes, SGML another system of the same era has a similarly declarative lexical analysis, although in that case more restricted in that the tokenisation is specified per document but can not be changed mid-document. But I think the assumption that there has to be a final polished reason is false, like many languages it was designed in a relatively short space of time by relatively few people (one man + some students in this case) with very few users to test the concepts. It is what it is. – David Carlisle Aug 01 '17 at 10:13
@AndréC Couldn't one ask equally why other languages don't allow change in meaning on-the-fly? I can imagine sitting down and saying 'I don't want to put a restriction on the meaning of material: just because I think _ means subscript doesn't mean everyone does'. That might well lead to a fluid model for character meaning. – Joseph Wright Aug 02 '17 at 16:02
1

@AndréC --- There is a remark on page 3 of the TeXBook which suggests that not all computers had the same characters available on the keyboard (remember that TeX was designed a long time ago). Allowing the syntax to change prevents this from being a problem. – Ian Thompson Aug 02 '17 at 16:32
@Ian Thompson, the first version of TeX did not include non-US keyboards. TeX was first designed for the American language. It added a bit to character encoding to adapt TeX to other languages that require accents. – AndréC Aug 02 '17 at 21:16
@AndréC --- That's not the point. Computing was in its infancy when TeX was written; there was no way to know what symbols would be available on a keyboard ten or twenty years later. Therefore Knuth allowed the syntax (e.g. using backslash as the escape character) to be changed. – Ian Thompson Aug 02 '17 at 21:24
@Joseph Wright, I ask myself if Donald Knuth did not introduce the possibility of modifying the compilation to allow the creation of a software overlay. Indeed, Knuth insists very much on the possibility of modifying everything. This is what allows, for example, Tikz to have the semicolon as the end instruction marker instead of the native space of TeX. – AndréC Aug 02 '17 at 21:29

score 10 · Accepted Answer · answered Aug 02 '17 at 18:27

Let me ignore the eyes–mouth–stomach (and…) terminology mentioned in the question, which is truly unique's to Knuth's description of TeX, and focus specifically on the tokens that the question asks about:

I have never seen other languages using this token system. It is not used by Pascal, C, PHP, SQL, LISP, etc. […] Why do other computer languages do without problems?

This is not so. Just a few seconds of searching will show you that the compilers for all these languages use tokens in their implementation:

Pascal:

GNU Pascal: The GNU Pascal Manual, Chapter 12: The GPC Source Reference, 12.2 12.2 GPC's Lexical Analyzer: “This very first stage of the compiler is responsible for reading what you have written and dividing it into tokens, the “atoms” of a computer language.”
Free Pascal: Free Pascal Reference guide, Chapter 1: Pascal Tokens: “Tokens are the basic lexical building blocks of source code: they are the ’words’ of the language: characters are combined into tokens according to the rules of the programming language.”

C:

C tokens: “In a C source program, the basic element recognized by the compiler is the "token." A token is source-program text that the compiler does not break down into component elements.” (Similarly see C++ tokens: A token is the smallest element of a C++ program that is meaningful to the compiler.)

PHP:

PHP Manual, Appendix List of Parser Tokens

SQL:

PostgreSQL 9.6.3 Documentation, Chapter 4. SQL Syntax, 4.1. Lexical Structure: “SQL input consists of a sequence of commands. A command is composed of a sequence of tokens, terminated by a semicolon (";").”
DB2 SQL, Language elements, Tokens: “The basic syntactical units of the SQL language are called tokens.”

Lisp:

ANSI Common Lisp standard, Chapter 2 Syntax, 2.3 Interpretation of Tokens
Common Lisp the Language, 22.1. Printed Representation of Lisp Objects, 22.1.1. What the Read Function Accepts: “Constituent and escape characters are accumulated to make a token, which is then interpreted as a number or symbol.”

A subtle difference is that when learning these languages you may get away without encountering any mention of tokens, as they can be treated as an implementation detail of the compiler of the language. In TeX's case though, there is strictly speaking no “language” (e.g. a language standard with multiple compilers written for that language): there is a single system, the TeX program, which happens to be written like a compiler.

The TeXbook doesn't use the word “compiler” anywhere, but TeX: The Program (invoke texdoc tex to read it) starts with

1. Introduction. This is TeX, a document compiler intended to produce typesetting of high quality.

So Knuth conceived of TeX as a “document compiler” at least when writing the program. Remember that Knuth's programming background was in writing compilers:

as a student at Case Institute of Technology in the late 50s he had co-written a compiler called RUNCIBLE (his second ever publication, after the one in MAD Magazine) (watch talk about it starting here),
on the strength of this he got a job as a consultant at Burroughs (while still a student) to write an Algol 58 compiler (read Richard Waychoff's account of it in “III. The Summer Of 1960 (Time Spent with don knuth)”
…and so on…, until in 1962 he was approached by Addison-Wesley to write a book on compilers, which morphed into his (ongoing) life's work, The Art of Computer Programming.
In 1977 when he wanted to write TeX, many of the precursors such as PUB (see Knuth's note PUB and pre-TeX history in TUGboat) were written as (or called themselves) compilers: the PUB manual is titled “PUB: The Document Compiler”.

So it's natural that TeX is written like a compiler.

And the overall plan of execution—read characters into tokens (syntax), then turn them into commands (semantics)—is (part of) how most compilers are written. In fact, the terms syntax and semantics aren't even restricted to programming languages: they are used in linguistics, for human languages.

(This is arguably just context for David's answer, which points out why a TeX user is more likely to encounter the idea of tokens than a programmer of some typical language: because the way text gets turned into tokens isn't fixed in TeX, and TeX usage tends to exploit that fact. Nevertheless, I thought it's worth pointing out that the idea of tokens themselves is not unique to TeX. Even tokenization being dynamic isn't unique to TeX, you can do some similar things in the C preprocessor and in Lisp-family languages as Aditya pointed out with the example of Racket.) — ShreevatsaR, Aug 02 '17 at 18:34
I thank you for these remarkable explanations of historical accuracy. I wanted to ask the same question to Donald Knuth himself, but he did not reply to the mails, it was his administrator Maggie A McLoughlin who answered that all these questions are all answered in his book Digital Typography. I do not know if I have not translated the question correctly in English, or if there is really an answer to this question in this book. — AndréC, Aug 02 '17 at 18:46
@AndréC Between David's answer and mine, is any part of your question still unanswered? As I tried to make clear, the answer to “Why does TeX need tokens?” is simply that TeX has tokens because that's what all languages do, and it's the natural way to write compilers. The book Digital Typography does have a lot of context on the history of Knuth's work on TeX, including his first two draft memos about TeX. But really, if your question is about tokens in general then you might be better off reading a book on compilers, as this is not specific to TeX. — ShreevatsaR, Aug 02 '17 at 20:38
The question is why did Donald Knuth give TeX the opportunity to play with the tokens. I can not believe it is accidental. — AndréC, Aug 02 '17 at 21:31
@AndréC Unless you play with category codes (and most users of TeX get by without ever using them), then the fact that tokenization can be changed doesn't really come into the picture. For example, nothing that you mentioned in the question hints at it; everything in your question is basically how all languages work. It's only David in the middle paragraph of his answer who brought that up. Is that really what your question is about? (The answer to that is straightforward: ASCII wasn't standardized at the time so to be usable on different keyboards, TeX couldn't hardcode character codes.) — ShreevatsaR, Aug 02 '17 at 21:56
All compilers use tokens but almost all languages can not modify them directly, they are fixed once and for all. As you pointed out, Donald Knuth had a perfect knowledge of compilers and programming languages, which suggests that he has made it possible to act on the tokens intentionally for a specific purpose. Tikz is written in TeX and has a syntax with semicolons at the end of instruction like many programming languages unlike TeX. So I wonder now if Donald Knuth did not allow to act on the tokens to allow the creation of software overlay like all the LaTeX packages. — AndréC, Aug 03 '17 at 05:41
@AndréC No he did not. Evidence: the first two draft TeX memos say nothing about catcodes. LaTeX does not really depend on catcode changes except as a convenience (using @ in “internal” names). The fact that TikZ can use semicolons specially is simply a consequence of using macros with pattern matching. Also, Knuth would consider things like TikZ and even LaTeX to be weird, as he thinks it would be simpler to change the program itself (he's expressed surprise on multiple occasions about such usage of TeX). You may read: — ShreevatsaR, Aug 03 '17 at 06:00
I am amazed that D. Knuth is surprised at the current use of LaTeX since it can be said that in some way LaTeX modifies TeX while allowing TeX to remain perennial. — AndréC, Aug 03 '17 at 07:49

What are tokens used for?

2 Answers2