1

For some reasons, I need to choose one or two character to delimit placeholders in LaTeX3 strings, like:

I like ``NAME`OF`FRUITS``

in order to allow a safer replace (hence the doubling of the first and last symbol to get rid of ambiguity) + easy reading.

But finding a good character is not that easy:

  • Many characters have special meanings for LaTeX and babel that redefine many characters, like ^, _, -, ;, ~, #, @… I’d prefer to avoid using them to avoid espace nightmares. And I don’t know all packages, and I’m sure that some popular packages can redefine other chars.
  • I’m thinking that non-ascii characters might be an issue since they might be interpreted differently on different computers. In particular, I don’t know if ° could be a valid choice.

Is there a list of character that are basically as safe as letters to use?

EDIT

Here is a more concrete use case I want to consider. I basically want to be sure that no matter what is around \robExtGetPlaceholder{__VEGETABLE__}, the _ should not be turned into another symbol, even after loading a popular package.

\documentclass{article}
\ExplSyntaxOn

\seq_clear_new:N \l_robExt_placeholders_seq

% Make sure that the placeholder is in the list \l_robExt_placeholders_seq. % This should automatically be called by other tools \NewDocumentCommand{\robExtAddPlaceholderToList}{m}{ \seq_put_left:Nn \l_robExt_placeholders_seq { #1 } }

\NewDocumentCommand{\robExtPlaceholderFromContent}{mm}{ \str_gset:cn { l_robExt_placeholder_#1_str } {#2} \message{aaaaaaaaaaaaa#1} \robExtAddPlaceholderToList{#1} }

\NewDocumentCommand{\robExtDebugPlaceholder}{sm}{ \message{Placeholder ~ #2 ~ contains: ~ \use:c{l_robExt_placeholder_#2_str}} \IfBooleanTF{#1}{\cs_show:c { l_robExt_placeholder_#2_str }}{} }

\NewDocumentCommand{\robExtGetPlaceholder}{m}{ \use:c{l_robExt_placeholder_#1_str} }

\NewDocumentCommand{\robExtDebugPlaceholdersContents}{s}{ \message{List ~ of ~ placeholders:} \seq_map_inline:Nn \l_robExt_placeholders_seq {\robExtDebugPlaceholder{##1}} \IfBooleanTF{#1}{\cs_show:N \l_robExt_placeholders_seq}{} }

%% I also have other commands, for instance to replace placeholders etc...

\ExplSyntaxOff \begin{document}

\robExtPlaceholderFromContent{FRUIT}{Orange} \robExtPlaceholderFromContent{SENTENCE}{I like FRUIT and VEGETABLE}

$\robExtPlaceholderFromContent{VEGETABLE}{Salad}$

\robExtDebugPlaceholdersContents*

Does it mean that whatever think is put around the get placeholder here, it will still be interpreted correctly? (no escape, no weird replacement of the character with another character…)

$1 + \robExtGetPlaceholder{FRUIT} + \robExtGetPlaceholder{VEGETABLE}$

\end{document}

tobiasBora
  • 8,684
  • I'm not sure what you are actually trying to achieve, but this may be related and, perhaps, useful: https://tex.stackexchange.com/a/472729/105447 – gusbrs Jul 10 '23 at 01:04
  • If you are compiling under UTF-8, try using ♪ (which is called by Alt+13). Most TeX/LaTeX compilers can parse it since it's part of an ASCII table and is not used by anything as far as I'm aware. There are other ASCII characters like that as the arrows ↑↓→← –  Jul 10 '23 at 01:07
  • 1
    @MananIntindinichu musical symbols and arrows are not ascii – David Carlisle Jul 10 '23 at 05:43
  • your question isn't very clear, what do you mean by "delimit" just uses as delimited argument delimiters as in \def\foo#1#2{...} which does not require the character to be a single token, or that it has any tex definition, or do you require a token that has a tex definition (and if so can it be a csname token such as \starthere – David Carlisle Jul 10 '23 at 09:13
  • @DavidCarlisle so basically I want a character, ideally not too visible + easy to type like an underscore,such that I can create a macro containing that char (defined via cs name), I can write this char anywhere like in \getValuePlaceholder{°°MY°FRUIT°°}without it being interpreted in a weird way due to babel, csquote, whatever, I want to be able to put this symbol directly in a string, and I want the strings to be strictly identical (when computing md5sum for instance), even if two strings are created from files with different encodings (don't want my utf-8 library to be broken on windows). – tobiasBora Jul 10 '23 at 11:49
  • 1
    "the _ should not be turned into another symbol," you are using str type so everything is catcode 12 and has no definition at all. – David Carlisle Jul 10 '23 at 19:49
  • Ok good then, I was not sure since I was using str/csname… Thanks! – tobiasBora Jul 11 '23 at 00:36

2 Answers2

2

In full genenerality the answer is

There is no safe character.

A slightly more helpful answer would be

Only you can say which are the safe (sequences of) characters. You can use any sequence guaranteed not to appear as data.

The bullet points on your question don't seem that relevant, babel, input encodings, etc (mostly) only matter if you are typesetting, but here you are just delimiting substrings.

You may think of ° as a degree sign character, but to pdftex that is the two character tokens C2 B0 and it's just safe to use as a delimiter if the two bytes C2 B0 will never appear in the delimited string. If you double it then the delimiter is the four tokens C2 B0 C2 B0 and the answer is the same, this is safe to use as a delimiter so long as the four bytes C2 B0 C2 B0 do not appear in the data to be delimited.

The same is true of higher characters to pdftex is not a duck but the four bytes F0 9F A6 86 and you can use ... as the delimiter so long as those four bytes do not appear in ...

You can also use multiple bytes of printable ascii, eg [[....]] delimits ... safely so long as you ensure ... never includes [[ or ]].

The details here are for classic 8bit tex such as latex or pdflatex, but the main point of the answer would be the same with xelatex or lualatex.

David Carlisle
  • 757,742
  • Thanks! So basically, I don’t care so much about the fact that this may appear in my data. My main fear is to see LaTeX either 1) change the definition of a character to something else (I had this problem with tikz and babel that redefine ; and broke tikz), since it certainly depends on how I use it, I added in my question the kind of code that I am interested in 2) that if two files have a different encoding (e.g. latin 1 instead of utf8), then my character might change between the two files (I don’t want this). So with your example, would ° still be C2B0 in a latin1-encoded file? – tobiasBora Jul 10 '23 at 18:01
  • 1
    why do you care if latex changes the definition, or even if it has a definition? If you typeset you will get an error that it is not set up, but that is not relevant if you are just using it as a delimiter. ° would have a different sequence or not be encodable at all in other encoding, but you can use token pair C2B0 in an encoding – David Carlisle Jul 10 '23 at 18:22
  • So I am writing a library in an UTF-8 file. So if I write ° there for instance, it will be encoded as C2 B0. But if a user/co-author/… uses a different encoding that mine, ° might be encoded into B0 (latin1).Which means that if somewhere it says please insert placeholder °FRUIT, then it will be translated to B0something.But if my library sayed °FRUIT should be equal to banana, then latex will translate, I guess, this to C2B0something. So the \str_replace_all macro will not replace °FRUIT with banana as C2B0something is different from B0something.Do you see my issue? – tobiasBora Jul 10 '23 at 18:38
  • 1
    @tobiasBora You can document the delimiter is C2B0 rather than ° but if you are worried about that use ascii eg [[fruit]] but really that is nothing to do with latex strings, and all about encdings. If you worry about that you have more than delimiters to worry about. °FRUIT will lose FRUIT if you use an encoding that only has (say) cyrilic and no latin script – David Carlisle Jul 10 '23 at 18:52
  • Ok thanks! I don't want my users to type weird binary stuff, so I will rather stick to ASCII, thanks! (and if the encoding cannot represent ASCII, then I user should have troubles running LaTeX anyway since no macro exists in cyrilic for what I know ^^). Good to know that [] is safe, that is a good option. I would also like to represent spaces somehow (for visibility reasons the placeholder should be upper case letter), do you know if _ is safe to use in that context, like [[MY_FRUIT]]? I was expecting it to fail in math expressions, but seems to work for now, not sure how robust it is. – tobiasBora Jul 10 '23 at 19:28
  • Btw, if _ is safe, I’d be curious to understand why >/… + babel was causing some troubles to tikz (see e.g. https://tex.stackexchange.com/questions/166772/problem-with-babel-and-tikz-using-draw). – tobiasBora Jul 10 '23 at 19:31
  • 1
    @tobiasBora I didn't say [[ is safe I said you could say it was safe. the tikz problem is about catcodes which isn't a problem if you use l3 str where all catcodes are normalized. If you allow different catcodes you may need to be more careful how you match – David Carlisle Jul 10 '23 at 19:35
  • Ahah thanks. hope that _ does not come with weird surprises, but in the worst case I guess it is possible for the user to define placeholder alias at the beginning of the file, removing annoying characters. Thanks! (PS: I’m always really surprised by how complicated LaTeX is… and why LaTeX3 has not tried to make it a bit more accessible, using a python/lua/… like syntax… but it is another question!) – tobiasBora Jul 10 '23 at 19:39
1

Edit

Turns out I was wrong on the assumption that some characters accessible by the Alt key were ASCII. While this workaround "works" it may introduce a plethora of unseen problems. Thanks to David Carlise for the clarification.

Original asnwer

Out of curiosity, I tested how far classic ASCII characters can be pushed to be used as functions or potentially escape characters as what you need, and turns out they run! At least in this example you can see something that may be close to your needs:

\documentclass{article}

\newcount\newcase \newcase=0 \def♪{% \ifnum\newcase=0 \newcase=1 [ \else \newcase=0 ] \fi }

\begin{document} Musical math: ♪1+1=2♪ \end{document}

At least, this works on LaTeX while running under UTF-8. There is a caveat though: only one “especial character” can be defined at the same time, for instance, doing something like this:

\def♪{A thing}
\def♫{Another thing}

Will cause an “Use of � doesn't match its definition” error. I'm not entirely sure of the reason, but at least this seems to work for one “odd” definition. And in that one “odd” definition you can go wild with characters like ☺☻♪♫←↓→↑§ and more!

Hope that helps!

  • 1
    sorry this is completely wrong. the characters are not ASCII and are multi byte utf 8 sequences. using \def that way breaks latex utf 8 support as you show – David Carlisle Jul 10 '23 at 05:46
  • ♪ is not ascii (which has no character codes above 127) it is U+266A and to pdftex is three characters: E2 99 AA your \def just redefines byte E2 so destroys every character whose UTF-8 encoding starts E2 – David Carlisle Jul 10 '23 at 09:18
  • @DavidCarlisle Is not? I assumed it was because old systems used and it's accessible via Alt+# by default. I'll edit the answer to prevent a misunderstanding. Thanks for the clarification! –  Jul 10 '23 at 18:04
  • this answer even after the edit is wrong, you should not suggest people break latex like this. – David Carlisle Jul 10 '23 at 18:17