Put data into variable (memory) or code?

Question

In order to (re-)write a boilerplate package (i.e. containing a lot of texts), I see two approaches.

a) Have the texts in code and use a deeply nested if-then-else structure that is evaluated every time.

b) Load all the texts into a expl3 data structure and then access via an index.

a) is proven to work, needs a lot of evaluation time and the code looks a bit awkward. b) will be more modern code, but might exceed some memory limits

Would b) work? I would like to hear a recommendation of an expl3 expert which way to go.

Update to answer additional questions.

This is to update the hpstatement.sty package, containing statements of the Globally Harmonized System of Classification and Labelling of Chemicals.

It will contain roughly 300 phrases per language. For a maximum of 24 languages, this means about 7500 phrases and 500 kB.

pdfTeX and XeTeX have 2.5 MB of memory for definitions (source), so unless you are saving really huge pieces of text, memory shouldn't be a problem. — Phelype Oleinik, Jun 07 '19 at 20:08
your question isn't really answerable in this form, in what way is lookup parametrised? why do you need conditionals? (I wouldn't use ifthenelse it's slow and clunky) why can't you just use a separate macro for each text? — David Carlisle, Jun 07 '19 at 20:09
How about putting the text chunks in individual files and only input the necessary ones? — Skillmon, Jun 07 '19 at 20:09
I don't quite understand (a), and exactly how much text? I don't see any reason not to use macros to hold this. I often just use the normal hash table. Hacking it in expl3 should not add that much overhang. Though you should probable explain a bit more detailed what you are attempting, this is a bit vague — daleif, Jun 07 '19 at 20:11
also what do you mean by memory or code here? either way you are just storing the data in tex's main token memory aren't you? — David Carlisle, Jun 07 '19 at 20:13
The 164 paragraphs of kantlipsum are stored in an expl3 sequence, totaling 147957 bytes. In your case, a sequence for each language doesn't seem too big. — egreg, Jun 07 '19 at 20:17
it would all fit in memory easily enough but i would guess that most documents only need one or two languages so you may prefer to use a file per language, there are no need for if tests really as you can build the command names directly from the arguments I would guess. — David Carlisle, Jun 07 '19 at 20:22
Building command names from arguments? That sounds good. It's so TeX'y that I did not think of that. — mhchem, Jun 07 '19 at 20:31

score 4 · Accepted Answer · answered Jun 07 '19 at 20:49

The 164 paragraphs in kantlipsum are stored in an expl3 sequence, with a total size of 147957 bytes.

I tried concatenating five copies of it, getting a sequence with 820 items and a total size of 739785 bytes.

The impact on memory is

 9810 strings out of 492609
 192626 string characters out of 6129049
 3060841 words of memory out of 5000000
 13800 multiletter control sequences out of 15000+600000

Just loading expl3 shows

 9774 strings out of 492609
 191933 string characters out of 6129049
 210796 words of memory out of 5000000
 13768 multiletter control sequences out of 15000+600000

You can probably load the strings for a language on demand, by storing them in separate files, so the impact would not be so big, being about 21 kiB. Having separate files for each language would also ease maintenance.

Accessing 300 item sequences is not so fast, but you can use csnames instead.

Here's a comparison: I first define a 300 item sequence, then 300 token lists; then I benchmark accessing random items.

\documentclass{article}
\usepackage{xparse}
%\usepackage{kantlipsum}
\usepackage{l3benchmark}

\ExplSyntaxOn

\int_step_inline:nn { 300 } { \seq_put_right:Nn \l_tmpa_seq { #1 } }

\int_step_inline:nn { 300 }
 {
  \tl_new:c { l_test_#1_tl }
  \tl_set:cn { l_test_#1_tl } { #1 }
 }

\begin{document}

\benchmark:n { \seq_item:Nn \l_tmpa_seq { \int_rand:n { 300 } } }

\benchmark:n { \tl_use:c { l_test_\int_rand:n { 300 }_tl } }

The result is

3.47e-4 seconds (1.08e3 ops)
5.6e-6 seconds (17.5 ops)

so there is a factor 100 in favor of the second method. Accessing the last item in the sequence takes essentially the same time as accessing a random one. For the second method, accessing one or another item is the same.

Put data into variable (memory) or code?

1 Answers1