Speed up CSV loading with datatools package?

Question

I am using the datatool package to read some CSV files. I've used the csvsimple package before, which I finally managed to break by adding more and more requirements to the mix. Now I'd like to switch to the datatool package, load the "databases" once and use them repeatedly. The problem is: Loading the files takes forever. The files have about 1000 lines each with 13, 6 and 8 columns respectively, and loading the first file alone takes way more than five minutes. If I turn on verbose logging of the datatool package, I can see the load process getting slower and slower the more data is read. Is this just the way it is (which would mean I'll have to find another alternative) or can I do anything to speed up the loading process?

Maybe file a bug report against datatool? I also have performance issues with datatool - see http://tex.stackexchange.com/q/225438/3406. — Faheem Mitha, Jan 29 '15 at 00:12
Can you provide code for a sample usage? It is one thing to confirm the problem with the present approach, but the development of alternatives generally requires a more specific knowledge of the application. — Steven B. Segletes, Feb 12 '15 at 12:27

score 5 · Answer 1 · answered Feb 12 '15 at 10:18

csvsimple is described as a lightweight tool that doesn't support sorting or data storage, whereas datatool is a heavyweight package that supports both sorting and storage, and it's the extra functionality that makes datatool slower than csvsimple.

The issue basically comes down to the fact that although TeX is Turing complete it's a document formatting language rather than a computer programming language and it doesn't have array-like structures in which to conveniently store multi-dimensional data. In most languages you can do something like float[][] array = new float[m][n] and then access an entry using something like array[i][j] and set the value for an entry using something like array[i][j] =value.

However in TeX, there are essentially two ways of storing data: either in a macro or in a register. The first version of datatool used a macro for storage, but the internal structure was changed in datatool version 2.0 thanks to Morten Høgholm who provided a much better method of storing the data within a token register. This significantly improved the data lookup, but since the entire database is stored within a single token register, changing the data (rather than a simple lookup) requires splitting up the contents of the register around the entry that needs changing and then recombining the whole thing with the new value and resetting the contents of the register. When loading a database from a CSV file, this needs to be done for every cell in the database. (Other things are done as well, such as determining the data type for the given column.)

The verbose mode can also add to the load time. This is off by default, but for big databases without the verbose mode it can look as though nothing's happening.

Here's an example document that simply loads data from a file called entries.csv:

\documentclass{article}

\usepackage{datatool}

\DTLloaddb{entries}{entries.csv}

\begin{document}

\end{document}

I created entries.csv with a single column of 100 randomly generated words. Including the header row, this file has 101 lines. Then I compiled this test document using the Linux time command:

time pdflatex test.tex

This produced:

real   0m0.177s
user   0m0.163s
sys    0m0.012s

With verbose mode switched on this changed to:

real    0m0.184s
user    0m0.163s
sys     0m0.018s

Then I retried with 10 columns and 100 rows (plus the header row) and the results were:

verbose off:

real    0m1.072s
user    0m1.059s
sys     0m0.012s

verbose on:

real    0m1.121s
user    0m1.100s
sys     0m0.014s

Then again with 10 columns and 10,000 rows (plus the header row):

verbose off:

real    1m44.484s
user    1m44.488s
sys     0m0.017s

verbose on:

real    1m47.569s
user    1m47.460s
sys     0m0.064s

So switching off the verbose mode can shave off a few seconds, but a far more significant improvement can be made by making TeX only set the contents of the token register once rather than repeatedly resetting it, and that can be done using datatooltk before compiling the document. This is a Java application that can rapidly load and sort data loaded from CSV, XLS or ODS files or imported from a MySQL database.

Now

time datatooltk --csv entries.csv --output entries.dbtex

produces:

real    0m0.453s
user    0m0.826s
sys     0m0.038s

The test document only requires a minor modification:

\documentclass{article}

\usepackage{datatool}

\DTLloaddbtex{\entriesdb}{entries.dbtex}

\begin{document}

\end{document}

(The data can now be referenced in commands such as \DTLforeach using the first argument of \DTLloaddbtex, \entriesdb in this case.)

Now the compile time is:

real    0m0.180s
user    0m0.160s
sys     0m0.018s

Another way of speeding up the document compilation time is to make sure you use the read-only \DTLforeach* if you want to iterate through the data without modifying it.

For example, using the read-write (unstarred) version:

\documentclass{article}

\usepackage{datatool}

\DTLloaddbtex{\entriesdb}{entries.dbtex}

\begin{document}

\DTLforeach{\entriesdb}{}{}

\end{document}

This takes:

real    0m11.000s
user    0m10.983s
sys     0m0.016s

Using the read-only (starred version):

\documentclass{article}

\usepackage{datatool}

\DTLloaddbtex{\entriesdb}{entries.dbtex}

\begin{document}

\DTLforeach*{\entriesdb}{}{}

\end{document}

This takes:

real    0m1.541s
user    0m1.527s
sys     0m0.012s

In summary, to speed up loading data with datatool:

Make sure you have at least version 2.0 (although that's now 6 years old).
Use datatooltk to first parse the data (and, if required, sort or filter it) and create a file that sets up the internal macros and registers that are used by datatool to store the data.

Pros and Cons of datatooltk verses \DTLloaddb:

With \DTLloaddb any LaTeX commands within the CSV file need a single backslash. With datatoolktk a double backslash is required.
\DTLloaddb is much slower than datatooltk.
\DTLloaddb can't load CSV files where an entry contains a line break, but datatooltk can.
\DTLloaddb can only load text files (comma-separated, tab-separated, etc), datatooltk can also load XLS, ODS and pull data from MySQL.

Speed up CSV loading with datatools package?

1 Answers1