accessing data using datatool when headers aren't available

Question

I have a series of csv files I'm expecting to receive on a regular basis, and there are two problems with these files.

I am not guaranteed to have headers.
I cannot rely on how many data entries there will be per row (sometimes there may be 50 or so, sometimes more)

What I am guaranteed is that, let's say, the 1st, 10th, and 30th column have the data I want. Up until now I've been using a perl script to collapse these csv files to something where the rows have only the elements I'm interested in.

Is there a way in datatool to tell it to create headers for certain columns and ignore the rest?

For example, it would be nice to be able to do something like the following:

\DTLloaddb[noheader,headers={modelname=1,serialnumber=10,weight=30}]{itemweights}{somefile.csv}

which would mean that modelname is in the first column, serialnumber is in the 10th column and weight is in column 30.

Here is an extremely simplified example of a csv file I might work with.

  "toaster",    "",      "",       "",       "234-02015-23-948",  "",  "",      "",      "",      "",  "1255",  "",  "" 
  "blender",    "junk",  "",       "",       "753-20842-46-824",  "",  "",      "junk",  "",      "",  "1567",  "" 
  "microwave",  "",      "",       "stuff",  "376-73178-68-487",  "",  "",      "",      "junk",  "",  "2434",  "",  "",  ""
  "radio",      "",      "stuff",  "junk",   "724-86499-61-446",  "",  "",      "",      "junk",  "",  "2343" 
  "tv",         "",      "",       "",       "423-90219-60-149",  "",  "junk",  "",      "",      "",  "3023",  "",  "",  ""

A solution that creates a header for every column is completely infeasible because I have upwards of 50 columns only 3 or 4 of which I'm interested in.

What I'd like to be able to do is write:

\DTLloaddb[noheader,headers={appliance=1,serialnumber=5,detailnumber=11}]{itemweights}{somefile.csv}

When you say "not guaranteed to have headers" does that mean sometimes you do and sometimes you don't? If so, I assume you can tell by inspection if the first row is or isn't a header? — Alan Munn, Jan 21 '13 at 16:33
If datatool doesn't turn out to be feasible pgfplotstable can be an alternative which can handle this. — percusse, Jan 21 '13 at 16:35
@alanmunn. Sometimes there are are headers, but only certain columns have them and many columns have duplicate names suchs comment, but most columns lack a header even when others have them. — A.Ellett, Jan 21 '13 at 16:39
So that means there is always a header row, even if every column doesn't make use of it. This is an important distinction. :) — Alan Munn, Jan 21 '13 at 16:49
@alanmunn. But datatool in such examples, if I tell datatool there is a header, it complains about duplicate header names. — A.Ellett, Jan 21 '13 at 16:51
Yes, I understand, so what you need to do is treat it as noheader and then deal with the first row manually, plus ignore columns. — Alan Munn, Jan 21 '13 at 16:54
@alanmunn. I can distinguish myself when there is and is not a header. I've been using perl to do that. What I would like to be able to do it skip perl. To clarify, occasionally, the data file completely lacks headers. When the data file does include headers, they are only even partially there and not guaranteed to be unique. — A.Ellett, Jan 21 '13 at 16:55
@percusse. Thank you for suggesting pgfplotstable. I hadn't known about it. I'm a bit reticent to use it because I continually find the various documentation for pgf stuff frustrating: I find I often have to read huge sections of the documentation to get the information I need. My job doesn't allow me the luxury of reading a manual cover to cover to learn some very cool tricks. I like what the pgf stuff can do, but the cost (cost of time) to do something simple is often immense. — A.Ellett, Jan 21 '13 at 17:03
@alanmunn, per your last comment, which I missed while writing mine, that's exactly what I want to do. But, I thought that's also what I was trying to illustrate in the question when I put noheader in the \DTLloaddb. — A.Ellett, Jan 21 '13 at 17:06
@A.Ellett Indeed it requires some affinity to key value system. Please take my suggestion as a backup if this turns out to be infeasible. — percusse, Jan 21 '13 at 17:14

score 3 · Accepted Answer · edited Jun 12 '17 at 06:53

Unless I'm misunderstanding something, this isn't too difficult to do, although perhaps not in exactly the way you envision. When datatool reads in a file, and you don't specify the headers, it will automatically assign headers of Column1 ... Column-n to your database. If you specify noheader then you don't need to worry about duplicate column names. You can use this to pick columns selectively from the database when you iterate through. In the example below, I'm assuming the file has a header row but it contains duplicates or blanks in some columns. If the file sometimes truly has no header, then the example would need to be modified to deal with that case as well.

\documentclass{article}
\begin{filecontents}{\jobname.csv}
1,,,Duplicate,5,Duplicate,7,8,9,10
Cell-1-2,Cell-2-2,Cell-3-2,Cell-4-2,Cell-5-2,Cell-6-2,Cell-7-2,Cell-8-2,Cell-9-2,Cell-10-2
Cell-1-3,Cell-2-3,Cell-3-3,Cell-4-3,Cell-5-3,Cell-6-3,Cell-7-3,Cell-8-3,Cell-9-3,Cell-10-3
Cell-1-4,Cell-2-4,Cell-3-4,Cell-4-4,Cell-5-4,Cell-6-4,Cell-7-4,Cell-8-4,Cell-9-4,Cell-10-4
\end{filecontents}
\usepackage{datatool}
\DTLloaddb[noheader]{testdb}{\jobname.csv}
\begin{document}
\begin{tabular}{lll}
ONE & SIX & NINE
\DTLforeach{testdb}{\one=Column1,\six=Column6,\nine=Column9}{%
\DTLiffirstrow{\\}{%
\one & \six & \nine\\
}}
\end{tabular}
\end{document}

output of code

Thank you very much. I can't find this approach in the manual. How did you know to do this? — A.Ellett, Jan 21 '13 at 18:00
Section 5.2 (p. 39 or thereabouts, depending on your version) of the manual discusses this in the description of the "keys" option — Alan Munn, Jan 21 '13 at 18:03
Ah ha! I'd seen that but I hadn't understood how to use it. Thanks all over again! — A.Ellett, Jan 21 '13 at 18:11

accessing data using datatool when headers aren't available

1 Answers1