Create a sorted, three-column table from a single column in a CSV file

Question

I have a CSV file that has various columns. One of the columns is called summits. This column contains multiple summits in the following form:

Summit name (elevation m)

For example

Mount Everest (8848m)

What I'm trying to do is to to create a table that lists all unique summits in descending order. For example like this:

I was already able to create a list with unique entries using the datatool package.

\DTLforeach*{tours}{\Summit=summits}{%
  \expandafter\DTLifinlist\expandafter{\Summit}{\uniquesummit}%
  {}
  {%
    \ifdefempty{\uniquesummit}%
    {\let\uniquesummit\Summit}% 
    {% 
      \eappto\uniquesummit{,\Summit}%
    }%
  }%
}

What is left is:

Splitting the name of the summit and the elevation
Sorting it
Creating a table

I thought I figured out creating the table with this answer, but in my case, all the summits are just in the first row and not split around multiple rows. I guess the reason is, that my list is in a slightly different form? But why?

Here is a MWE that should better describe my problem. Please note that in the actual CSV file, the ; is a tab. I couldn't get it working with a tab in the MWE:

\documentclass{article}
\usepackage{filecontents}
\usepackage{datatool}
\begin{filecontents}{summits.csv}
title;summits;description
test1;Matterhorn (4500m);description1
test2;Mount Everest (8848m);description2
test3;K2 (8611m), Kangchenjunga (8586m);description3
test4;Mount Everest (8848m);description4
\end{filecontents}
\begin{document}
    \DTLsetseparator{;}%
    \DTLloaddb{tours}{summits.csv}
\newcommand*{\uniquesummit}{}

\DTLforeach*{tours}{\Summit=summits}{%
    \expandafter\DTLifinlist\expandafter{\Summit}{\uniquesummit}%
    {}{
        \ifdefempty{\uniquesummit}%
        {\let\uniquesummit\Summit}%
    {%
        \eappto\uniquesummit{,\Summit}%
        }%
    }%
}


% Shows that I can find the unique summits
List of unique summits: \uniquesummit.
\bigskip
% This I would like to generate based on the summits.csv
\begin{tabular}{l l l}
1 &Mount Everest & 8848\
2 & K2 & 8611\
3 & Kangchenjunga & 8586\
4 & Matterhorn & 4500\
\end{tabular}
\end{document}

Can you please show a small test CSV-file of a few lines in the correct (your) format? — Skillmon, Jan 16 '21 at 18:55
Tab is a bad idea, as TeX (usually) treats it as a space. Better use ; or , as the column delimiter. — Skillmon, Jan 16 '21 at 19:25
Are the fields title and description important or could they be dropped? — Skillmon, Jan 16 '21 at 19:32
@Skillmon Yes it is, but unfortunately, I can't change that. Title and description should be dropped. — SCM, Jan 16 '21 at 19:52

Skillmon · Accepted Answer · 2021-01-16T21:29:33.143

The following is using expl3 for the data processing. It works in the following way:

First read in the first line of the file and parse the header. Find the column which is called summits, so that in further processing we know which column is the interesting one.

Next read the file line by line and extract the $n$-th column (the one named summits). Split that column on commas (to get all the summits in that line), and then extract the name and height of each summit via a regular expression. Put the found values into a single seq variable.

Afterwards we remove duplicates from this seq, and sort for the height of the mountains (in descending order).

Then output the table. Each seq element in the now sorted table is given to \readmountains_output_single:n which formats every line.

This is all done by the macro \readmountains, which takes an optional and a mandatory argument. The mandatory argument is the file name, the optional one is the column delimiter (defaulting to a Tab, which is ^^I if you want to input it directly from within TeX, but the optional argument wants the delimiter escaped, so you'd provide \^^I for a Tab, which is the default).

(there is a bit of error checking going on, so that

\documentclass[]{article}
\begin{filecontents}{\jobname.csv}
title;summits;description
test1;Matterhorn (4500m);description1
test2;Mount Everest (8848m);description2
test3;K2 (8611m), Kangchenjunga (8586m);description3
test4;Mount Everest (8848m);description4
\end{filecontents}
\ExplSyntaxOn
\ior_new:N \g_readmountains_file_ior
\seq_new:N \l_readmountains_mountains_seq
\seq_new:N \l_readmountains_extract_seq
\int_new:N \g_readmountains_output_int
\int_new:N \l_readmountains_col_int
\int_new:N \l_readmountains_catcode_int
\int_new:N \l_readmountains_line_int
\seq_new:N \l_readmountains_csv_seq
\scan_new:N \s_readmountains_end
\msg_new:nnn { readmountains } { no-summits }
  { Column ~ summits' ~ not ~ found. ~ Aborting. } \msg_new:nnn { readmountains } { short-line } { Short ~ line ~ encountered ~ in ~ line ~ \int_use:N \l_readmountains_line_int. ~ Aborting. } \msg_new:nnn { readmountains } { malformatted-summit } { Malformatted ~ summit ~ encountered ~ in ~ line ~ \int_use:N \l_readmountains_line_int. ~ Aborting. } \NewDocumentCommand \readmountains { O{\^^I} m } { \readmountains:Nn #1 {#2} } \cs_new_protected:Npn \readmountains:Nn #1#2 { \readmountains_read_file:NnN #1 {#2} \l_readmountains_mountains_seq \seq_remove_duplicates:N \l_readmountains_mountains_seq \readmountains_sort:N \l_readmountains_mountains_seq \readmountains_output:N \l_readmountains_mountains_seq \use_none:n \s_readmountains_end } \cs_new_protected:Npn \readmountains_read_file:NnN #1#2#3 { \int_set:Nn \l_readmountains_catcode_int { \char_value_catcode:n {#1 } }
    \char_set_catcode_other:N #1
    \ior_open:Nn \g_readmountains_file_ior {#2}
    \seq_clear:N #3
    \ior_get:NN \g_readmountains_file_ior \l_tmpa_tl
    \exp_args:NNx \seq_set_split:NnV \l_readmountains_csv_seq
      { \char_generate:nn { #1 } { 12 } } \l_tmpa_tl \int_zero:N \l_readmountains_col_int \seq_map_inline:Nn \l_readmountains_csv_seq { \int_incr:N \l_readmountains_col_int \str_if_eq:nnT { summits } {##1} { \seq_map_break: } } \str_if_eq:eeF { \seq_item:Nn \l_readmountains_csv_seq \l_readmountains_col_int } { summits } { \msg_error:nn { readmountains } { no-summits } \readmountains_use_none_delimit_by_s_end:w } \int_set:Nn \l_readmountains_line_int \c_one_int \ior_map_inline:Nn \g_readmountains_file_ior { \int_incr:N \l_readmountains_line_int \exp_args:NNx \seq_set_split:Nnn \l_readmountains_csv_seq { \char_generate:nn {#1 } { 12 } }
          {##1}
        \int_compare:nNnT
          \l_readmountains_col_int > { \seq_count:N \l_readmountains_csv_seq }
          {
            \msg_error:nn { readmountains } { short-line }
            \ior_map_break:n { \readmountains_use_none_delimit_by_s_end:w }
          }
        \exp_args:Nnnx
        \seq_set_split:Nnn \l_readmountains_csv_seq {,}
          { \seq_item:Nn \l_readmountains_csv_seq \l_readmountains_col_int }
        \seq_map_inline:Nn \l_readmountains_csv_seq
          {
            \tl_if_empty:nF { ####1 }
              {
                \regex_extract_once:nnNF { (.*)\s((\d+)m) } { ####1 }
                  \l_readmountains_extract_seq
                  {
                    \msg_error:nn { readmountains } { malformatted-summit }
                    \readmountains_use_none_delimit_by_s_end:w
                  }
                \seq_push:Nx #3
                  {
                    { \seq_item:Nn \l_readmountains_extract_seq {2} }
                    { \seq_item:Nn \l_readmountains_extract_seq {3} }
                  }
                \use_none:n \s_readmountains_end
              }
          }
      }
    \ior_close:N \g_readmountains_file_ior
    \char_set_catcode:nn { `#1 } \l_readmountains_catcode_int
  }
\cs_new_protected:Npn \readmountains_sort:N #1
  {
    \seq_sort:Nn #1
      {
        \int_compare:nNnTF { \use_ii:nn ##1 } < { \use_ii:nn ##2 }
          \sort_return_swapped:
          \sort_return_same:
      }
  }
\cs_new_protected:Npn \readmountains_output:N #1
  {
    \int_gzero:N \g_readmountains_output_int
    \begin { tabular } { r l r }
      \seq_map_function:NN #1 \readmountains_output_single:n
    \end { tabular }
  }
\cs_new_protected:Npn \readmountains_output_single:n #1
  {
    \int_gincr:N \g_readmountains_output_int
    \int_use:N \g_readmountains_output_int .
    & \use_i:nn #1 
    & \use_ii:nn #1
    \
  }
\cs_new:Npn \readmountains_use_none_delimit_by_s_end:w #1 \s_readmountains_end
  {}
\ExplSyntaxOff
\begin{document}
% since the file was written with ; as the delimiter use that
\readmountains[;]{\jobname.csv}
\end{document}

Wow that's awesome. I just tried it an it works perfectly! Also thanks a lot for the explanation. That is very helpful.
I hope with this explanation I can figure out the last bit that I forget to put into the MWE. Sometimes, the summits row can also be empty. At the moment this code then fails, because it does not match the regex in the readmountains_read_file. — SCM, Jan 16 '21 at 21:18
@SCM I've changed the code such that it ignores empty summits (while still throwing an error on malformatted summits). Also, now after throwing the error for malformatted summits the macro only ignores this one entry. (I didn't test these changes, but only done them directly in this answer, they should work but poke me if they don't) — Skillmon, Jan 16 '21 at 21:30
@SCM also, feel free to leave a comment if you need more explanations (or want me to add comments to the code with more explanations), if time permits I'd add them. — Skillmon, Jan 16 '21 at 21:32
@SCM if you want to add a table headrow, make these changes in \readmountains_output:N (keep in mind that a space must be added as ~ in between of \ExplSyntaxOn ...\ExplSyntaxOff). To change the formatting of individual rows change the definition of \readmountains_output_single:n (use \use_i:nn #1 to access the name, \use_ii:nn for the height). — Skillmon, Jan 16 '21 at 21:42
Thanks! Your code was so readable that I was already able doing that on my own. Thansk! — SCM, Jan 16 '21 at 22:21

Create a sorted, three-column table from a single column in a CSV file

1 Answers1