How to look up a cell value of a multi-column csv file?

Question

Following up this answer, how can I have an optional argument of the command \getValue that specifies the column in which I look up the cell value so that it has the following syntax?

\getValue\<macro name>%
[<optional column number with default of 2, and the counting starts with 1 at the very first column>]%
{<search phrase>}%
{<data of the csv file>}

For example, \getValue\SndParThrd[4]{Second Parameter}{mydata} (after tweaking the following code) should store 44 in \SndParThrd.

Also, \getValue\SndParFrst{Second Parameter}{mydata} should store 3 in \SndParFrst.

\begin{filecontents*}{test.csv}
Third Parameter  , 7 , 9  , 
First Parameter  , 5 , 10 ,
Second Parameter , 3 , 6  , 44
\end{filecontents*}
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
% Step 1: reading the file
\ior_new:N \l__diaa_csv_ior
\ior_new:N \l__diaa_csv_str_bool
\NewDocumentCommand \ReadCSV { s m >{ \SplitArgument {1} {,} }O{ 1,2 } m }
  {
    \IfBooleanTF {#1}
      { \bool_set_true:N \l__diaa_csv_str_bool }
      { \bool_set_false:N \l__diaa_csv_str_bool }
    \diaa_csv_read:nnnn {#2} #3 {#4}
  }
\cs_new_protected:Npn \diaa_csv_read:nnnn #1 #2 #3 #4
  {
    \prop_new:c { g__diaa_csv_#1_prop }
    \ior_open:NnTF \l__diaa_csv_ior {#4}
      { __diaa_csv_read:cnn { g__diaa_csv_#1_prop } {#2} {#3} }
      { \msg_error:nnn { diaa } { file-not-found } {#4} }
  }
\msg_new:nnn { diaa } { file-not-found }
  { File~`#1'~not~found. }
\cs_new_protected:Npn __diaa_csv_read:Nnn #1 #2 #3
  {
    \bool_if:NTF \l__diaa_csv_str_bool
      { \ior_str_map_inline:Nn }
      { \ior_map_inline:Nn }
        \l__diaa_csv_ior
        {
          \prop_put:Nxx #1
            { \clist_item:nn {##1} {#2} }
            { \clist_item:nn {##1} {#3} }
        }
  }
\cs_generate_variant:Nn __diaa_csv_read:Nnn { c }
%
% Step 2: getting the values
\NewDocumentCommand \getValue { m m m }
  { \tl_set:Nx #1 { \diaa_csv_item:nn {#2} {#3} } }
\NewExpandableDocumentCommand \CSVItem { m m }
  { \diaa_csv_item:nn {#1} {#2} }
\cs_new:Npn \diaa_csv_item:nn #1 #2
  { \prop_item:cn { g__diaa_csv_#2_prop } {#1} }
\ExplSyntaxOff
\begin{document}
\ReadCSV{mydata}{test.csv}
\getValue\rdPar{Third Parameter}{mydata}
\rdPar
\edef\stPar{\CSVItem{First Parameter}{mydata}}%
\stPar
\end{document}

frougon · Accepted Answer · 2020-09-16T15:56:13.863

This is a modified version of Phelype Oleinik's answer that you linked to in the question. The first method is shorter, the second is better. :-)

First method

Column numbers in the CSV file start at 1. The new commands have the following syntaxes:

\ReadCSV[*] {label} [key column] {value columns} {file}

where key column defaults to 1 and value columns is a comma-separated list of column numbers (only values from these columns are stored by \ReadCSV to allow later retrieval by \getValue and \CSVItem).

\getValue {\macro} [value column] {key} {label}

This is what you asked for in the question. value column defaults to 2. The contents of value column corresponding to the specified key is stored in \macro (which could be an expl3 token list variable).

\CSVItem {key} [value column] {label}

This command is fully expandable. It recursively expands to the contents of value column corresponding to the specified key. value column defaults to 2, as in \getValue.

\begin{filecontents*}{test.csv}
Third Parameter  , 7 , 9  ,
First Parameter  , 5 , 10 ,
Second Parameter , 3 , 6  , 44
\end{filecontents*}
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
% Step 1: reading the file
\ior_new:N \l__diaa_csv_ior
\bool_new:N \l__diaa_csv_str_bool
\seq_new:N \l__diaa_csv_tmp_seq
% str mode (bool/star), label, key column, value columns, file
\NewDocumentCommand \ReadCSV { s m O{1} m m }
  {
    \IfBooleanTF {#1}
      { \bool_set_true:N \l__diaa_csv_str_bool }
      { \bool_set_false:N \l__diaa_csv_str_bool }
    \diaa_csv_read:nnnn {#2} {#3} {#4} {#5}
  }
% label, key column, value columns, file
\cs_new_protected:Npn \diaa_csv_read:nnnn #1 #2 #3 #4
  {
    \seq_set_split:Nnn \l__diaa_csv_tmp_seq { , } {#3}
    \seq_map_inline:Nn \l__diaa_csv_tmp_seq
      { \prop_new:c { g__diaa_csv_#1_##1_prop } }
\ior_open:NnTF \l__diaa_csv_ior {#4}
  {
    \__diaa_csv_read:nn {#1} {#2}
    \ior_close:N \l__diaa_csv_ior
  }
  { \msg_error:nnn { diaa } { file-not-found } {#4} }

}
\msg_new:nnn { diaa } { file-not-found }
  { File~`#1'~not~found. }
% label, key column
\cs_new_protected:Npn __diaa_csv_read:nn #1 #2
  {
    \bool_if:NTF \l__diaa_csv_str_bool
      { \ior_str_map_inline:Nn }
      { \ior_map_inline:Nn }
        \l__diaa_csv_ior
        {
           \seq_map_inline:Nn \l__diaa_csv_tmp_seq
             {
               \prop_put:cxx { g__diaa_csv_#1_####1_prop }
                 { \clist_item:nn {##1} {#2} }
                 { \clist_item:nn {##1} {####1} }
             }
        }
  }
% Step 2: getting the values
% macro or tl var, value column, key, label
\NewDocumentCommand \getValue { m O{2} m m }
  { \tl_set:Nx #1 { \diaa_csv_item:nnn {#3} {#2} {#4} } }
% key, column containing the desired value, label
\NewExpandableDocumentCommand \CSVItem { m O{2} m }
  { \diaa_csv_item:nnn {#1} {#2} {#3} }
% key, column containing the desired value, label
\cs_new:Npn \diaa_csv_item:nnn #1 #2 #3
  { \prop_item:cn { g__diaa_csv_#3_#2_prop } {#1} }
\ExplSyntaxOff
\begin{document}
% Use default column for the key (1)
\ReadCSV{mydata}{2,4}{test.csv} % discard the third column
\getValue\rdPar[4]{Second Parameter}{mydata}
\rdPar                         % 44
\getValue\rdPar{Second Parameter}{mydata}
\rdPar                         % 3
\getValue\rdPar{Third Parameter}{mydata}
\rdPar                         % 7
\edef\rdPar{\CSVItem{First Parameter}{mydata}}%
\rdPar                         % 5
\edef\rdPar{\CSVItem{First Parameter}[4]{mydata}}%
\ifx\rdPar\empty
  \textlangle empty\textrangle
\else
  \rdPar
\fi
\end{document}

Second method

This is based on the previous code, but:

parsing of the CSV lines is now done with \seq_set_split:Nnn instead of clist functions in order not to ignore empty cells;
it is more memory-efficient (only one l3prop is used to store all selected columns);
column addressing is different: when retrieving values with \getValue or \CSVItem, column 1 is the first selected column, column 2 is the second selected column, etc. Accordingly, the default column number when retrieving values with these functions is 1 (instead of 2).
\getValue now has a star version that performs a global assignment.

Since you want autodetection of the number of columns, I had to change the syntax of \ReadCSV as follows:

\ReadCSV[*] [key column] {label} [value columns] {file}

key column defaults to 1 as above. However, when value columns is blank, the first line of file is read to determine the number n of columns, and value columns are defined to be all columns from 2 to n (both inclusive).

Spaces on both sides of column-delimiter commas are ignored when reading the file, and one set of outer braces around each cell contents is removed, if any (this is illustrated by the cell containing foo, bar in the sample CSV file below).

\begin{filecontents*}{test.csv}
Third Parameter  , 7 , 9          ,
First Parameter  , 5 , {foo, bar} ,
Second Parameter , 3 , 6          , 44
\end{filecontents*}
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
% Step 1: reading the file
\ior_new:N \l__diaa_csv_ior
\bool_new:N \l__diaa_csv_str_bool
\seq_new:N \l__diaa_csv_tmp_seq
% str mode (bool/star), key column, label, value columns, file
\NewDocumentCommand \ReadCSV { s O{1} m O{} m }
  {
    \IfBooleanTF {#1}
      { \bool_set_true:N \l__diaa_csv_str_bool }
      { \bool_set_false:N \l__diaa_csv_str_bool }
    \diaa_csv_read:nnnn {#3} {#2} {#4} {#5}
  }
% label, key column, value columns, file
\cs_new_protected:Npn \diaa_csv_read:nnnn #1 #2 #3 #4
  {
    \tl_if_blank:nTF {#3}       % Detect number of columns and use 2 to last
      {
        \ior_open:NnTF \l__diaa_csv_ior {#4}
          {
            \bool_if:NTF \l__diaa_csv_str_bool
              { \ior_str_get:NN }
              { \ior_get:NN }
              \l__diaa_csv_ior \l_tmpa_tl
        \ior_close:N \l__diaa_csv_ior
        \seq_set_split:NnV \l_tmpa_seq { , } \l_tmpa_tl
        \seq_clear:N \l__diaa_csv_tmp_seq
        \int_step_inline:nnn { 2 } { \seq_count:N \l_tmpa_seq }
          { \seq_put_right:Nn \l__diaa_csv_tmp_seq {##1} }
      }
      { \msg_error:nnn { diaa } { file-not-found } {#4} }
  }
  { \seq_set_split:Nnn \l__diaa_csv_tmp_seq { , } {#3} } % explicit columns

\ior_open:NnTF \l__diaa_csv_ior {#4}
  {
    \prop_new:c { g__diaa_csv_#1_prop }
    \__diaa_csv_read:nn {#1} {#2}
    \ior_close:N \l__diaa_csv_ior
  }
  { \msg_error:nnn { diaa } { file-not-found } {#4} }

}
\msg_new:nnn { diaa } { file-not-found }
  { File~`#1'~not~found. }
\cs_generate_variant:Nn \prop_put:Nnn { cxV }
% label, key column
\cs_new_protected:Npn __diaa_csv_read:nn #1 #2
  {
    \bool_if:NTF \l__diaa_csv_str_bool
      { \ior_str_map_inline:Nn }
      { \ior_map_inline:Nn }
        \l__diaa_csv_ior
        {
          \seq_set_split:Nnn \l_tmpa_seq { , } {##1} % split one CSV row
          \tl_clear:N \l_tmpa_tl
          \seq_map_inline:Nn \l__diaa_csv_tmp_seq
            {
              \tl_put_right:Nx \l_tmpa_tl { { \seq_item:Nn \l_tmpa_seq {####1} } }
            }
      \prop_put:cxV { g__diaa_csv_#1_prop }
        { \seq_item:Nn \l_tmpa_seq {#2} }
        \l_tmpa_tl
    }

}
% Step 2: getting the values
% star → global assignment, macro or tl var, value column, key, label
\NewDocumentCommand \getValue { s m O{1} m m }
  {
    \IfBooleanTF {#1} { \tl_gset:Nx } { \tl_set:Nx }
      #2 { \diaa_csv_item:nnn {#4} {#3} {#5} }
  }
% key, value column, label
\NewExpandableDocumentCommand \CSVItem { m O{1} m }
  { \diaa_csv_item:nnn {#1} {#2} {#3} }
\cs_generate_variant:Nn \tl_item:nn { f }
% key, value column, label
\cs_new:Npn \diaa_csv_item:nnn #1 #2 #3
  {
    \tl_item:fn { \prop_item:cn { g__diaa_csv_#3_prop } {#1} } {#2}
  }
\ExplSyntaxOff
\begin{document}
% Use default column for the key (1). The second empty optional argument (list
% of value columns) means we want to autodetect the value columns; then, the
% first column is for keys and all other columns are used as value columns.
\ReadCSV{mydata}{test.csv}
\getValue\rdPar{Second Parameter}{mydata}
\rdPar                         % 3
\getValue\rdPar[2]{Second Parameter}{mydata}
\rdPar                         % 6
\getValue\rdPar[3]{Second Parameter}{mydata}
\rdPar                         % 44
\getValue\rdPar{Third Parameter}{mydata}
\rdPar                         % 7
\edef\rdPar{\CSVItem{First Parameter}{mydata}}%
\rdPar                         % 5
\edef\rdPar{\CSVItem{First Parameter}[2]{mydata}}%
\rdPar                         % foo, bar
\edef\rdPar{\CSVItem{First Parameter}[3]{mydata}}%
\ifx\rdPar\empty
  \textlangle empty\textrangle
\else
  \rdPar
\fi
\end{document}

I am sorry for asking this naive question, but what does the line starts with \ReadCSV mean? I don't quite understand the meaning of {2,4} in it. — Diaa, Sep 15 '20 at 16:49
It reads the CSV file and stores one mapping (an l3prop) per “value column” you are interested in. The {2,4} argument means to only store values from columns 2 and 4. Column 1 is used for the keys (this is the default) and column 3 is discarded (if you want it not to be discarded, use {2,3,4} instead of {2,4}). — frougon, Sep 15 '20 at 16:54
This is just a little optimization: discarded columns occupy no memory. — frougon, Sep 15 '20 at 17:08
I suspected it at the first sight, but it was weird for me to find that discarding the third column didn't completely remove it from mydata when calling it by \getValue as I notice in other programming languages when dealing with matrices of data. — Diaa, Sep 15 '20 at 17:13
You obtain values from the discarded column using \getValue? With which invocation, precisely? — frougon, Sep 15 '20 at 17:16
I am writing from my phone so I can't write so clearly. What I mean that when you discard the third column, I expect that the size of the matrix of data stored in mydata changed from 3 x 4 to 3 x 3, which should (in my mind) make the following command to be \getValue\rdPar[3]{Second Parameter}{mydata} by having 3 in the value column not 4 when wanting to retrieve the value of the fourth column in the origin data. Again, I didn't fully test it on my PC but the syntax confused me. — Diaa, Sep 15 '20 at 17:25
Column numbers when retrieving values aren't influenced by discarded columns. They are just used as is to name the various l3prop variables (one per non-discarded column). This could be done differently, of course, and we could even use only one l3prop with items as {first val}{second val}...{last val}. Different ways... — frougon, Sep 15 '20 at 17:30
If I want to consider all the columns, should I write all their indices? If yes, can it be the default behavior to include all the columns? — Diaa, Sep 15 '20 at 17:58
I wrote another version as explained just above. This changes column addressing when retrieving values: column 1 is the first non-discarded column, column 2 is the second non-discarded column, etc. What you asked can be done, but I need to know which version you want this to work with. I've added the second version in my answer so that you can see and test. — frougon, Sep 15 '20 at 18:16
I started with your second method, and what I actually need to tweak is to change the command \ReadCSV{<label>}{<columns to store>}{<csv file>} syntax to make {<columns to store>} optional not mandatory with a default value of storing all the available columns of the original csv file. — Diaa, Sep 15 '20 at 19:04
For the column addressing, that's what I actually need to count the already-stored non-discarded columns without considering their original order in the csv file. — Diaa, Sep 15 '20 at 19:17
I think the update to “Second method” should do what you want. Don't forget to remove test.csv before testing it for the first time, because we do need the empty pair of braces (or anything else as long as it is not blank or commas) in the last column of the first row; otherwise, the autodetection routine won't see the last column. I'll be away from the computer now. — frougon, Sep 15 '20 at 19:32
Firstly, I would like to highly appreciate your patience and attention to details every time you answer my question. Secondly, this file is driving me crazy. I created it with Excel, then when using your code \getValue\rdPar[2]{Tepoly}{mydata}, \rdPar prints nothing. However, when copying and pasting its last problematic line of Tepoly to the filecontents environment, it works flawlessly. Can you suggest me something to do? Or what makes the command \rdPar prints nothing as if I didn't write it? — Diaa, Sep 15 '20 at 19:55
I've updated the “Second method” to use \seq_set_split:Nnn in order to split lines of the CSV files. This way, empty cells aren't ignored, even if they only appear as consecutive commas. You may prevent a comma from being interpreted as a column delimiter by wrapping the whole cell inside braces, but you don't need to use {} anymore in order to make sure an empty cell won't be ignored. — frougon, Sep 16 '20 at 00:51
For the second method, kindly consider my following-up question here https://tex.stackexchange.com/q/575030/2288. Thanks :) — Diaa, Dec 14 '20 at 20:20

Steven B. Segletes · Answer 2 · 2021-06-22T09:21:01.693

With the readarray package...

\begin{filecontents*}{test.csv}
Third Parameter  , 7 , 9  , 
First Parameter  , 5 , 10 ,
Second Parameter , 3 , 6  , 44
\end{filecontents*}
\documentclass{article}
\usepackage{readarray,pgffor}
\newcommand\getValue[3][2]{%
  \def\tmpA{#2}%
  \foreach\z in{1,...,\nrows}{%
    \arraytomacro#3[\z,1]\tmpB
    \ifx\tmpA\tmpB #3[\z,#1]\fi
  }%
}
\begin{document}
\readarraysepchar{,}
\readdef{test.csv}\mydata
\readarray\mydata\myarray[-,\ncols]
\nrows{} rows in data
Cell(2,2): \myarray[2,2]
Cell(3,4): \myarray[3,4]
\getValue{Third Parameter }{\myarray}
and
\getValue[4]{Second Parameter }{\myarray}
\end{document}

Note, the leading/trailing spaces in the input data are retained, by default. However, the use of \readarray* in lieu of \readarray will strip leading/trailing spaces (in which case the search field specifications must also have such spaces removed).

\begin{filecontents*}{test.csv}
Third Parameter  , 7 , 9  , 
First Parameter  , 5 , 10 ,
Second Parameter , 3 , 6  , 44
\end{filecontents*}
\documentclass{article}
\usepackage{readarray,pgffor}
\newcommand\getValue[3][2]{%
  \def\tmpA{#2}%
  \foreach\z in{1,...,\nrows}{%
    \arraytomacro#3[\z,1]\tmpB
    \ifx\tmpA\tmpB #3[\z,#1]\fi
  }%
}
\begin{document}
\readarraysepchar{,}
\readdef{test.csv}\mydata
\readarray*\mydata\myarray[-,\ncols]
\nrows{} rows in data
Cell(2,2): \myarray[2,2]
Cell(3,4): \myarray[3,4]
\getValue{Third Parameter}{\myarray}
and
\getValue[4]{Second Parameter}{\myarray}
\end{document}

How to look up a cell value of a multi-column csv file?

2 Answers2

First method

Second method

Linked