LaTeX3 - Advice on a code to validate an ISO date

Question

The following code is a POC showing a way to validate one Julian date in "pure" LaTeX3: it is related to this question.

There is no doubt that this code contains some clumsiness. Any advice would be welcome, except for error handling via messages, which is something I know how to do.

For example, do \str_new:N \l_mbc_date_year_str and \l_mbc_date_year_int are both necessary?

\documentclass{article}
% Source for easy testing via pgffor:
%     + https://tex.stackexchange.com/a/696444/6880
\usepackage{pgffor}
\ExplSyntaxOn
\seq_new:N \g_mbc_month_size
\seq_set_from_clist:Nn \g_mbc_month_size {%
  0,  % Not used.
  31, % January
  0,  % February: this special value will help us to find bugs...
  31, % March
  30, % April
  31, % May
  30, % June
  31, % July
  31, % August
  30, % September
  31, % October
  30, % November
  31  % December
}
% The rule defining a leap year A is as follows:
%
%    + If A % 4 != 0, the year is not a leap year.
%
%    + If A % 4 = 0 , the year is a leap year unless 
%      A % 100 = 0 and A % 400 != 0.
%
% This leads to the following one-line validating test.
%
%     + (A % 4 = 0) AND (A % 100 != 0 OR A % 400 = 0)
\prg_set_conditional:Npnn \if_leap_year:N #1 { p , T , TF } {
  \bool_if:nTF {
    \int_compare_p:n { \int_mod:nn #1 { 4 } = 0 }
    && (
      \int_compare_p:n { \int_mod:nn #1 { 100 } != 0 }
      ||
      \int_compare_p:n { \int_mod:nn #1 { 400 } = 0 }
    )
  }{
    \prg_return_true:
  }{
    \prg_return_false:
  }
}
\regex_new:N \g_mbc_date_format_rgx
\regex_set:Nn\g_mbc_date_format_rgx { \A (\d+) - (\d+) - (\d+) \Z }
\tl_new:N \l_mbc_date_year_tl
\tl_new:N \l_mbc_date_month_tl
\tl_new:N \l_mbc_date_day_tl
\int_new:N \l_mbc_date_year_int
\int_new:N \l_mbc_date_month_int
\int_new:N \l_mbc_date_day_int
\NewDocumentCommand { \ValidateISODate }{ m }{
  \regex_extract_once:NnNTF \g_mbc_date_format_rgx { #1 } \l_tmpa_seq {
% Integer values found.
    \seq_pop_right:NN \l_tmpa_seq \l_mbc_date_day_tl
    \seq_pop_right:NN \l_tmpa_seq \l_mbc_date_month_tl
    \seq_pop_right:NN \l_tmpa_seq \l_mbc_date_year_tl
\int_set:Nn \l_mbc_date_day_int   \l_mbc_date_day_tl
\int_set:Nn \l_mbc_date_month_int \l_mbc_date_month_tl
\int_set:Nn \l_mbc_date_year_int  \l_mbc_date_year_tl


% 1 <= month <= 12
    \int_compare:nTF { 1 <= \l_mbc_date_month_int <= 12 }{
% February special setting.
      \int_compare:nT { \l_mbc_date_month_int = 2 }{
        \if_leap_year:NTF \l_mbc_date_year_int {
          \seq_set_item:Nnn \g_mbc_month_size 2 { 29 }
        }{
          \seq_set_item:Nnn \g_mbc_month_size 2 { 28 }
        }
      }
% Good day.
      \int_compare:nTF {
        1 <= \l_mbc_date_day_int 
          <= \seq_item:Nn \g_mbc_month_size 
                          { \int_use:N \l_mbc_date_month_int }
      }{
        OK
% Bad day.
      }{
        KO (day)
      }
% NOT(1 <= month <= 12).
    }{
      KO (month)
    }
% Syntax error
  }{
    KO (syntax)
  }
}
\ExplSyntaxOff
\begin{document}
\section{OK}
\pgfkeys{
  tester/.code=\ValidateISODate{#1}{:} #1\par\medskip,
  tester/.list = {
    2023-06-14,
    2023-09-24,
    2023-02-28,
    2024-02-29,
    400-02-29
  }
}
\section{KO -- Invalid day}
\pgfkeys{
  tester/.code=\ValidateISODate{#1}{:} #1\par\medskip,
  tester/.list = {
    300-02-29,
    2023-02-29,
    2024-02-30,
    2023-09-00,
    2023-09-32
  }
}
\section{KO -- Invalid month}
\pgfkeys{
  tester/.code=\ValidateISODate{#1}{:} #1\par\medskip,
  tester/.list = {
    2023-19-32,
    2023-00-29
  }
}
\section{KO -- Syntax error}
\pgfkeys{
  tester/.code=\ValidateISODate{#1}{:} #1\par\medskip,
  tester/.list = {
    2023-06-XX,
    2023-09-19 2023-09-20,
    -0001-12-24
  }
}
\end{document}

Good point about str vs tl... As far as speed is concerned, time will tell whether optimisation is necessary or not. — projetmbc, Sep 23 '23 at 21:14
Noted. In my previous attempt , I use \int_compare:nTF, but I really think that the first concern is readability, and if optimization becomes a need, then you can "dirty" the code. — projetmbc, Sep 23 '23 at 21:27
For 30 and 31, I prefer a fixed uncalculated sequence, even if this is clearly not clever. At worst, I would tend to produce this type of constant via a Python script. — projetmbc, Sep 23 '23 at 21:41
Your variable \g_mbc_month_size is not named according the the l3 specification. A sequence variable name should always end _seq. If it is internal, it should begin \l__ or \g__. Also, I doubt whether you need a global variable here. If not, it would be better local. The same goes for all other cases of internal variables/functions etc. They all should have __ at the start or immediately following the initial \if, \l, \c, \g etc. Plus your conditional should include mbc. \ValidateJulianDate should preferably include mbc. Also, these are NOT Julian dates. — cfr, Sep 24 '23 at 02:41
You should use \prg_new_conditional:Npnn to avoid overwriting an existing defintion. You should only use set if you've previously used new. Note you can use \prg_generate_conditional_variant:Nnn to generate alternative argument specifications. This would be another way of dealing with the N/n confusion. If you generate a variant of n using V, you can pass an unbraced variable directly. — cfr, Sep 24 '23 at 03:00
@cfr So should there always be a N/V variant for each n just in case someone wants to use \int_year or \my_value_year instead of { \int_year }? Why shouldn't I force my user (or me in case of __= to always use { … } even if it's just one token?. Will the N/V get expanded once and then forwarded to the n version if the variants are generated? — Qrrbrbirlbel, Sep 24 '23 at 03:24
@Qrrbrbirlbel No, you don't need to generate them or force bracing. The user or yourself generates any missing variants as they're needed. It works in the same way for your own functions as it does for kernel ones. For example, I have \cs_generate_variant:Nn \int_abs:n { v } but also \cs_generate_variant:Nn \__chronos_dateformat_sign:n { v }. It's not a question of shouldn't force your user to use braces. It's a question of can't. (Well, I suppose you could, but not without breaking a whole lot of stuff.) — cfr, Sep 24 '23 at 03:36
A V substitutes the value of the variable into a function specified with n. I'm not sure if forward is quite the right way to put it. You can't generate a variant which replaces n with N. N and n are base forms. All functions are defined with argument specifications which use only the base forms (N,n etc.). You can't define a function with V or o in the argument specification. Those are always generated variants. N -> c; n > V, v, e, o etc. — cfr, Sep 24 '23 at 03:43
@Qrrbrbirlbel If you really want to force the user to brace the argument, you need to use one of the more primitive forms i.e. the equivalent of defining a delimited macro. But expl3 discourages that unless it's really unavoidable. — cfr, Sep 24 '23 at 03:47
Well, now that two people told you that your question is wrong (iso vs julian) my answer feels even more pointless. I explicitly decided that 400-2-003 is a valid input because it might not be ISO but it is still unambiguous and we can extract the integers for future usage. — Qrrbrbirlbel, Sep 24 '23 at 10:02
Maybe dealing only with ISO will be sufficient in concrete use. I am a L3 baby coder so my concern is more about coding. ;-) — projetmbc, Sep 24 '23 at 10:11

score 6 · Accepted Answer · answered Sep 24 '23 at 08:59

You're dealing with ISO formatted dates, not with Julian dates: a Julian day is just an integer, a Julian date is a decimal number.

ISO allows dates to be in the forms yyyy-mm-dd or yyyymmdd, but let's assume you only allow the hyphenated form.

Starting with a regex comparison seems a good strategy: the regular expression to match is

\A \d{4} \- \d{2} \- \d{2} \Z

that is, the input should consist of four digits, a hyphen, two digits, a hyphen, two digits. Any other input should result in the invalid ISO date message.

Here's my proposal that you can check with yours.

\documentclass{article}
\ExplSyntaxOn
\NewDocumentCommand{\validateISOdate}{m}
 {
  \projetmbc_isodate_validate:n { #1 }
 }
\cs_new_protected:Nn \projetmbc_isodate_validate:n
 {
  \regex_match:nnTF { \A \d{4} - \d{2} - \d{2} \Z } { #1 }
   {
    __projetmbc_isodate_validate:n { #1 }
   }
   {
    Invalid~date~'#1'~(format)
   }
 }
\cs_new:Nn __projetmbc_isodate_validate:n
 {
  __projetmbc_isodate_process:w #1 \q_stop
 }
\cs_new:Npn __projetmbc_isodate_process:w #1 - #2 - #3 \q_stop
 {
  __projetmbc_isodate_check:nnn { #1 } { #2 } { #3 }
 }
\cs_new:Nn __projetmbc_isodate_check:nnn
 {
  \bool_lazy_or:nnTF { \int_compare_p:nNn { #2 } < 1 } { \int_compare_p:nNn { #2 } > { 12 } }
   {
    Invalid~date~'#1-#2-#3'~(month)
   }
   {
    __projetmbc_isodate_day:nnn { #1 } { #2 } { #3 }
   }
 }
\cs_new:Nn __projetmbc_isodate_day:nnn
 {
  \int_compare:nNnTF { #3 } = { 0 }
   {
    Invalid~date~'#1-#2-#3'~(day)
   }
   {
    __projetmbc_isodate_day_aux:nnn { #1 } { #2 } { #3 }
   }
 }
\cs_new:Nn __projetmbc_isodate_day_aux:nnn
 {
  \int_compare:nNnTF { #3 } > { __projetmbc_isodate_checkday:nn { #1 } { #2 } }
   {
    Invalid~date~'#1-#2-#3'~(day)
   }
   {
    Valid~date~'#1-#2-#3'
   }
 }
\cs_new:Nn __projetmbc_isodate_checkday:nn
 {
  \int_case:nn { #2 }
   {
    {1}{31}
    {2}{__projetmbc_isodate_february:n{#1}}
    {3}{31}
    {4}{30}
    {5}{31}
    {6}{30}
    {7}{31}
    {8}{31}
    {9}{30}
    {10}{31}
    {11}{30}
    {12}{31}
   }
 }
\cs_new:Nn __projetmbc_isodate_february:n
 {
  \bool_lazy_and:nnTF
   {% year divisible by 4
    \int_compare_p:nNn { \int_mod:nn { #1 } { 4 } } = { 0 }
   } % AND
   {% year not divisible by 100 or divisible by 400
    \bool_lazy_or_p:nn
     {% not divisible by 100
      \int_compare_p:nNn { \int_mod:nn { #1 } { 100 } } > { 0 }
     }
     {% divisible by 400
      \int_compare_p:nNn { \int_mod:nn { #1 } { 400 } } = { 0 }
     }
   }
   {29} % year is leap
   {28} % year is not leap
 }
\ExplSyntaxOff
\begin{document}
\ExplSyntaxOn
\NewDocumentCommand{\test}{m}
 {
  \clist_map_inline:nn { #1 } { \projetmbc_isodate_validate:n { ##1 } \par }
 }
\ExplSyntaxOff
\test{
    2023-06-14,
    2023-09-24,
    2023-02-28,
    2024-02-29,
    400-02-29,
    300-02-29,
    2023-02-29,
    2024-02-30,
    2023-09-00,
    2023-09-32,
    2023-19-32,
    2023-00-29,
    2023-06-XX,
    2023-09-19,
    2023-09-20,
    -0001-12-24,
    2000-02-29,
    2100-02-29
}
\end{document}

The missing tool for me was the \bool_lazy_or_p (@ Qrrbrbirlbel's user avatar Qrrbrbirlbel has used it my previous post). I like the code, but I have question. Why don't you catch the value with the regex? Is it for performance reason you want to eat the hyphenated list of tokens. Can I steal your creation? :-) — projetmbc, Sep 24 '23 at 09:07
@projetmbc When the input is in a very specific format (checked with the regex) capturing the items is easier and faster (and expandable, by the way) with delimited arguments. — egreg, Sep 24 '23 at 09:11
I plan to write easy-to-understand tutorials about LaTeX3 that I find very easy to use, but I still need to explore a lot more of the API... — projetmbc, Sep 24 '23 at 09:17

Qrrbrbirlbel · Answer 2 · 2023-09-24T20:37:25.203

Let me summarize my findings and a few opinions:

Section 19.3 starts with “the actions at the left of the sequence are faster than those acting on the right“, so \seq_pop_right:NN seems to be a bad start. I'd swap that out for \seq_pop_left:NN.

In my answer to your linked question, I've used the expandable \seq_item:Nn inside of an int_compare statement. Unfortunately, the manual does not mention the speed or efficiency of this seemingly direct access.

At least then we don't have to use extra macros (“token lists”) for these things.

If we allow delimited macros in a LaTeX3 approach I'd also revert back to my first version of my other answer and use those. For me, it is much more natural to use #n than deal with sequences.
That said, the manual also says that the second N from \seq_pop_…:NN is a token list but you're using a “string” variable (sure, both are just macros). The n of \int_set:Nn is an “int expr” with no further specifcation (but not a N).

In the first solution (delimited macro) I don't need to access any sequence and in the second solution (heavy regex) I'm going to use \seq_item:Nn again. Unless you need to use those values again later I don't see the need to store them in some macro/token list/count.
The int variable is a real TeX count and \int_set:Nn basically does #1=\numexpr#2\relax.

So the question becomes: is comparing numbers faster when one of them is a count and does it matter if you only use them once or twice per date? It already needs to be \numexpred once anyway to store the count in the first place.
I think more importantly is the efficiency note in section 20.5: \int_compare(_p):nNn(TF) is five times faster than \int_compare(_p):n(TF) – I'm assuming because it only uses the TeX basics =, < and > and doesn't have to parse and convert !=, >= and <=.
Parsing of &&, || and () ought to be slower than directly accessing the functions And or Or. But that will be a compromise between speed, natural readability and logic.

In my code I get away with simple binary Ors – lazy ones at that which && and || are not (section 9.3).
If \if_int_compare:w is “allowed” in a pure LaTeX3 solution I would use \if_case:w for getting the last day of the month instead of a look-up in a sequence.

Using 0 as the last day in cases of the month number being outside of the range 1–12 already can do our month check
You're mixing your Ns and ns.

N stands for a single token and n stands for a “set of token given in braces” (section 1.1) but the manual also states “if you use a single token for an n argument, all will be well”.

However, I would do \if_leap_year:n and brace the #1 after \int_mod:nn which would then not break \if_leap_year_p:nTF { 2023 }.

The hard-coded 2 for the \seq_set_item:Nnn should be okay. (By the way, you set the second item of your 30/31 list with the first February and then it will always be that value and the special value will be gone forever.)
I've also incorporated some of the things @cfr mentioned in the comments, though whether a leap year tester must be an internal macro can be debated.

We can also create a regular expression that also catches illegal month and day numbers:

\regex_new:N \g__mbc_datex_format_rgx
\regex_set:Nn\g__mbc_datex_format_rgx { \A (\d+) \-
  (0* (?:[1-9]|1[012])) \- (0* (?:[1-9]|[12][0-9]|3[01])) \Z }

The only test to be done then is: last day of month in that year. Again, in this example I'm using the whole l3regex system by extracting the values with it as well.

Testing with l3benchmark seems to reveal that this takes roughly double the time than the delimited macro approach with only the primitive \if_case:w check, unsuprisingly.

Code

\documentclass{article}
\usepackage{pgffor, l3benchmark}
\ExplSyntaxOn
\cs_new:Nn __mbc_lastday_of_month_in_year:nn {
  \if_case:w #1~
    0 \or: 31 \or: __mbc_if_leap_year:nTF {#2}{29}{28} \or: 31
      \or: 30 \or: 31 \or: 30
      \or: 31 \or: 31 \or: 30
      \or: 31 \or: 30 \or: 31
      \else: 0 \fi:
}
\prg_new_conditional:Npnn __mbc_if_leap_year:n #1 { TF } {
  \int_compare:nNnTF     {\int_mod:nn {#1} {   4 } } = { 0 } {
    \int_compare:nNnTF   {\int_mod:nn {#1} { 100 } } = { 0 } {
      \int_compare:nNnTF {\int_mod:nn {#1} { 400 } } = { 0 } {
        \prg_return_true:     % multiple of 400 →    leap year
      }{ \prg_return_false: } % multiple of 100 → no leap year
    }{ \prg_return_true: }    % multiple of   4 →    leap year
  }{ \prg_return_false: }  % no multiple of   4 → no leap year
}
\regex_new:N \g__mbc_date_format_rgx
\regex_set:Nn\g__mbc_date_format_rgx { \A \d+ - \d+ - \d+ \Z }
\prg_new_conditional:Npnn __mbc_if_validate_date:n #1 { TF }{
  \regex_match:NnTF \g__mbc_date_format_rgx { #1 } {
    __mbc_datetest_parse:w #1 \q_stop
  }{ \prg_return_false: }
}
\cs_new:Npn __mbc_datetest_parse:w #1 - #2 - #3 \q_stop {
  \bool_lazy_or:nnTF % case 0 and case else fail this for invalid months
    { \int_compare_p:nNn { #3 } < { 1 } }
    { \int_compare_p:nNn { #3 } >
      { __mbc_lastday_of_month_in_year:nn { #2 } { #1 } } }
  { \prg_return_false: }
  { \prg_return_true:  }
}
%%% heavy regex solution
\regex_new:N \g__mbc_datex_format_rgx
\regex_set:Nn\g__mbc_datex_format_rgx { \A (\d+) -
  (0* (?:[1-9]|1[012])) - (0* (?:[1-9]|[12][0-9]|3[01])) \Z }
\prg_new_conditional:Npnn __mbc_if_validate_datex:n #1 { TF }{
  \regex_extract_once:NnNTF \g__mbc_datex_format_rgx { #1 } \l_tmpa_seq {
    \int_compare:nNnTF { \seq_item:Nn \l_tmpa_seq {4} }
                       >
                       { __mbc_lastday_of_month_in_year:nn
                         { \seq_item:Nn\l_tmpa_seq {3} }
                         { \seq_item:Nn\l_tmpa_seq {2} } }
    { \prg_return_false: }
    { \prg_return_true:  }
  }{ \prg_return_false: }
}
\NewDocumentCommand { \ValidateJulianDate }{ m m m }{
  __mbc_if_validate_date:nTF { #1 }{ #2 }{ #3 }
}
\NewDocumentCommand { \ValidateJulianDateX }{ m m m }{
  __mbc_if_validate_datex:nTF { #1 }{ #2 }{ #3 }
}
\ExplSyntaxOff
\setlength\parindent{0pt}
\pgfkeys{
  tester/.code = \ValidateJulianDate {#1}{OK}{Not OK} and
                 \ValidateJulianDateX{#1}{OK}{Not OK}: #1\par\medskip}
\newcommand*\test[1]{\pgfkeys{tester/.list={#1}}}
\begin{document}
\section{OK}
\test{2023-06-14, 2023-09-24, 2023-02-28, 2024-02-29, 400-02-29}
\section{KO -- Invalid day}
\test{300-02-29, 2023-02-29, 2024-02-30, 2023-09-00, 2023-09-32}
\section{KO -- Invalid month}
\test{2023-19-32, 2023-00-29}
\section{KO -- Syntax error}
\test{2023-06-XX, 2023-09-19 2023-09-20, -0001-12-24}
\clearpage
\section{Table}
\ExplSyntaxOn\ttfamily
\foreach \v in {\ValidateJulianDate, \ValidateJulianDateX}{
  \expandafter \string \v : \par
  \benchmark_once:n {
    \foreach \m in {0, ..., 13}{
      \ifnum\int_mod:nn{\m-1}{3}=0\medskip\fi
      \foreach \d in {0, ..., 32}{
        \ifnum\int_mod:nn{\d+1}{10}=0\relax\space\fi
        \exp_args:Nx\v{2023-0\m-0\d}{1}{0}
      }
      \par
    }
  }
  \bigskip
}
\ExplSyntaxOff
\end{document}

Output

About the internal bit: I only meant if something is meant to be internal .... There is a rather substantive explanation of 'integer expression' at the start of the documentation of that module. In the version I have it's section 20.1 pp.161-162, but the documentation of \int_eval:n etc. might help, too. It specifies what counts as n type here, which is slightly non-standard. — cfr, Sep 24 '23 at 05:43
Good point for the catching regex but I will investigate if it is slower than using a basic one. For the leap year, I think useful to have a dedicated predicate. I will in — projetmbc, Sep 24 '23 at 08:49
Thanks for your proposition. I will do a mix of your design choice and some of mine. — projetmbc, Sep 24 '23 at 08:56
One question: I have binary outputs for the benchmarks. Why? — projetmbc, Sep 24 '23 at 08:57
I keep your code in my folders. I plan to write easy-to-understand tutorials about LaTeX3 that I find very easy to use, but I still need to explore a lot more of the API... — projetmbc, Sep 24 '23 at 09:17
@projetmbc Well, if using a delimited macro is still allowed under LaTeX3 I'd prefer that too. Then we don't even have to deal with accessing the items in the sequence. It's just annoying we can't do this inline and need to define yet another macro. Please do write an easy-to-understand tutorial. Understanding the manual without examples isn't easy without sitting down and reading it from front to back. I rarely delve into L3 because the start of the learning curve is so steep. — Qrrbrbirlbel, Sep 24 '23 at 10:14
By the way, my first version of the previous uses a delimited macro to split the values. I thought using l3regex to catch these elements would be “neater”. There's also xparse with which we can preprocess arguments and split them at the - … — Qrrbrbirlbel, Sep 24 '23 at 10:38
@Qrrbrbirlbel xparse is deprecated. For recent kernels, you no longer need it. — cfr, Sep 24 '23 at 20:05
@cfr Yes, I meant the functionalities that were previously available by the xparse package. There should still be > \SplitArgument { 2 } { - } m which already gives us the values splitted at - and -NoValue- markers if the argument doesn't have enough hyphens. But checking for numbers is still needed then. — Qrrbrbirlbel, Sep 24 '23 at 20:09
@projetmbc I've updated my answer. I've reverted back to my previous answer using delimited macros, I like them much more but I will leave my statements about the datatypes and the N/n discussion since the other answer doesn't go into this at all. The binary output is on purpose to make it easier to notice difference on a glance (and obvious mistakes). — Qrrbrbirlbel, Sep 24 '23 at 20:40

LaTeX3 - Advice on a code to validate an ISO date

2 Answers2

Code

Output