4

In LaTeX one uses metadata frequently, most common being:

  • \title
  • \author
  • \date
  • \address

Which are mostly used without a validity check.

Is there a package that would allow one to define other entries of metadata and use the values in documents and possibly define tests and check the values of the entries?

It seems that such a package would be extremely useful in expanding the use of metadata, allowing definition of hierarchies of data within and above all allowing tests.

  • 2
    the problem is that "validity" is quite vage and depends on the use case. What should e.g. date test? Is "Christmas 2023" a valid date? – Ulrike Fischer Jan 07 '24 at 13:56
  • Of course, and one could define some standard validity checks for standard metadata entries and leave the user to define their own tests Using their own means (lua,latex,....) or even calls to an outside tests for standard stuff outside TeX like a zip-code, latitude, time-zone, etc... – TeX Apprentice Jan 07 '24 at 14:02
  • 1
    Well if arbitrary words like christmas are allowed, then everything is allowed and a validity test is trivial. – Ulrike Fischer Jan 07 '24 at 14:12
  • The definition of an entry would require at a minimum 'format', so an user or author of a package using metadata should be free to decide if 'christmas 2045' should be an accepted format or not. The user can then develop his own test for validity and I do not see why it would be trivial. – TeX Apprentice Jan 07 '24 at 14:18
  • The definition of an entry like date could include multiple entry formats: xx/yy/zzzz or yy/zzzz ... The renderization would depend on language and locale and the test for the validity of 29/02/2024 is already a quite non trivial programming. – TeX Apprentice Jan 07 '24 at 14:43
  • Sure, but what I mean that you need to come up with a specification first. Currently you only have quite vage ideas both about what are the standard tests and which extension tests users will perhaps want to add. And date is actually rather easy as there are at least some (ISO-)standards, but address or author can get quite interesting. – Ulrike Fischer Jan 07 '24 at 15:00
  • What I envisage is really a meta-package, that is, a package to be used by other packages or classes. And would only facilitate the definition, use and testing -- which would be entirely left to the author, with obviously some examples to guide use. – TeX Apprentice Jan 07 '24 at 15:12
  • 2
    then start and write the documentation and the specification. Once that is there it should be easy to write the code. – Ulrike Fischer Jan 07 '24 at 15:14
  • 1
    The above comment by @UlrikeFischer is worth knowing. I write my own code that way (specs, docs, code, then revise specs and docs). Note that PDF uses different encoding for metadata, depending on whether the data is in the old info dictionary, or the newer XMP. Also, the date formats differ. Also, the active characters differ. Existing hypperref knows how to handle most of this (for existing metadata). If you or anyone else wants to create the desired package, you might peek into hyperref. If your metadata needs only XMP, might be easiest. – rallg Jan 07 '24 at 16:58
  • 1
    @rallg Can you pls explain to me what are meta data and what is the difference between data and metadata with respect to the OP example? You can easily parse json for example with Lua and you can store whatever you need, validate etc. Lua is a data language and you can use it for handling data of any form. But as Ulrike mentioned, step 1, develop a spec. – yannisl Jan 07 '24 at 17:08
  • For me this will be a first, and I'll give it a try at the specs and docs. If any of you, have any recommendations on a good set of specs, @UlrikeFischer, please let me know. – TeX Apprentice Jan 07 '24 at 18:21
  • 1
    @yannisl I use "metadata" with a specific technical meaning for PDF. It is information stored in the PDF, but not printed. It can be read by PDF reader software and printers, and digital database records, although some software reads very little of it, and other software reads more of it. Title, author, creation date, software that produced the PDF, UUID (if used), are examples of metadata. In some cases, instructions for printing are metadata. If keywords are used, they are metadata. The word "data" is more generic. – rallg Jan 07 '24 at 21:45
  • @TeXApprentice You can see the pdf specification for example: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf – yannisl Jan 08 '24 at 01:37
  • @rallg Agreed, so one of the first things the OP should write in the spec, the definition of metadata=Embed in the PDF a custom info|dictionary to be read by computers and not humans! – yannisl Jan 08 '24 at 02:34
  • 1
    @yannisl I am of the impression that nowadays, the use of XMP (a form of XML) is considered better than the info dictionary. However, my depth of knowledge is insufficient to be sure. – rallg Jan 08 '24 at 03:27
  • @rallg Yes is better and there is already a package for that at ctan hyperxmp – yannisl Jan 08 '24 at 04:29

3 Answers3

5

As mentioned in the comments the question is rather underspecified. However, it is relatively easy to write a "test suite" of sorts for the metadata. The commands \title, \author etc. write their arguments into the internal macros \@title, \@author etc. You can inspect the values of these internal macros for example with regular expressions. While this is by no means the best or easiest way to perform such checks, it can serve as a proof of concept.

Note however that in the general case most input should be permissible in these fields. For the toy example below I have defined some rules but it is rather easy for each of them to imagine a situation where breaking the rule would be reasonable. I think that is not a limitation of my toy rules as such, for any rule that you could come up with there are exceptions.

Now for the code. First it defines \address analogous with \title (i.e., storing the argument in \@address), because this command is not defined by default in the article class.

Then it sets some messages to display during the checks. Afterwards the internal macros are copied into token lists and used as input for the regular expressions.

The rest of the code is general administrative stuff: for some reason \regex_match does not accept a token list variable but only a token list directly, so a variant needs to be created. I attempted a separation between package code and user commands and put everything in a .sty file.

File metadatacheck.sty:

\ProvidesPackage{metadatacheck}[2024/01/07 v0.1 Regex Check on Metadata]
\newcommand{\address}[1]{\gdef\@address{#1}}

\ExplSyntaxOn \prg_generate_conditional_variant:Nnn \regex_match:nn {nV} {T,F,TF} \msg_new:nnnn{metadatacheck}{titleok}{Title~check~OK}{} \msg_new:nnnn{metadatacheck}{authorok}{Author~check~OK}{} \msg_new:nnnn{metadatacheck}{dateok}{Date~check~OK}{} \msg_new:nnnn{metadatacheck}{addressok}{Address~check~OK}{}

\msg_new:nnnn{metadatacheck}{titleerr}{Title~does~not~start~with~upper~case}{} \msg_new:nnnn{metadatacheck}{authorerr}{Author~cannot~have~numbers}{} \msg_new:nnnn{metadatacheck}{dateerr}{Date~does~not~include~4-digit~year}{} \msg_new:nnnn{metadatacheck}{addresserr}{Address~too~short~(5~characters~minimum)}{}

\cs_new:Nn \check_meta_data: { \tl_set:Nx \l_title_tl {@title} \tl_set:Nx \l_author_tl {@author} \tl_set:Nx \l_date_tl {@date} \tl_set:Nx \l_address_tl {@address}

\regex_match:nVTF{\A[A-Z]}{\l_title_tl}{\msg_note:nn{metadatacheck}{titleok}}{\msg_warning:nn{metadatacheck}{titleerr}} \regex_match:nVTF{\A[^0-9]+\Z}{\l_author_tl}{\msg_note:nn{metadatacheck}{authorok}}{\msg_warning:nn{metadatacheck}{authorerr}} \regex_match:nVTF{\d{4}}{\l_date_tl}{\msg_note:nn{metadatacheck}{dateok}}{\msg_warning:nn{metadatacheck}{dateerr}} \regex_match:nVTF{.{5}}{\l_address_tl}{\msg_note:nn{metadatacheck}{addressok}}{\msg_warning:nn{metadatacheck}{addresserr}} } \NewDocumentCommand{\CheckMetaData}{ } { \check_meta_data: } \ExplSyntaxOff

User document:

\documentclass{article}
\usepackage{metadatacheck}
\title{my Title}
\author{Alic3 Sm1th}
\date{Christmas '23}
\address{23 Mulholland Drive}
\CheckMetaData

\begin{document} \maketitle \end{document}

Result (in terminal and log file):

Package metadatacheck Warning: Title does not start with upper case

Package metadatacheck Warning: Author cannot have numbers

Package metadatacheck Warning: Date does not include 4-digit year

Package metadatacheck Info: Address check OK

Of course this could be extended further by allowing the user to provide the regexes and the messages, either via document commands or maybe with a configuration file.

Marijn
  • 37,699
  • 1
    This is very useful. Might be a bit advanced for those who are not sure what to do with metadata (or what it means). – rallg Jan 07 '24 at 21:52
  • 1
    This is very nice and exactly what I had in mind for the embryonic version. I would like to be able to call an external application for validation, to check ranges included in the definition of the metadata entry ....and will try to do some work on the interface. – TeX Apprentice Jan 07 '24 at 22:42
3

At present at least one package available at ctan developed by Scott Pakin https://ctan.org/pkg/hyperxmp?lang=en hyperxmp can do what the OP is requesting and would be a good starting point, if there is a need to extend it.

In the comments it was suggested that a specification is written first before any development work is done. I also provided a link to the adobe specification, which I think is a good example. In many cases just enumerated lists and tables can be used. The most important part in my opinion, is the scope of the program. In the context of the question, it has to answer, what software are expected to read this information, given that metadata are interfaces for the exchange of information between different programs. Once the specification is written, the validation functions can be developed. Care should be taken to cover different writing systems and encodings, as necessary.

An example of data validation for dates conforming to LaTeX package requirements can be found in l3doc listings, written using the l3 programming layer. A marvel of l3 programming acrobatics!

Just a word of caution that web development went through similar stages of development, but such ontologies failed to really catch on, as search became better and now with LLMs they might fade all together, but this is just my opinion and not meant to discourage you.

yannisl
  • 117,160
  • Thanks for laying out some of the important issues at play here, especially on the further user of this metadata after validation and document production. Would you be able to point where is the date-validation routine of l3doc? – TeX Apprentice Jan 08 '24 at 11:44
  • @TeXApprentice https://github.com/latex3/latex3/blob/main/l3kernel/l3doc.dtx round line 1349 – yannisl Jan 08 '24 at 12:09
0

This is not exactly an answer but the specs asked by Ulrike Fischer in the comemnts above. It is still in embryonic form and we will edit it accordingly.

The package metadata will define and allow for use and validation of metadata entries.

To define a new enrty use:

\definemetadataentry[type=
                     format=yyyy-mm-dd,
                     lower_range=0001-01-01,
                     upper_range=9999-12-31,
                     finite_range=,
                     default_value=,
                     optional/required,
                     validation_routine=,
                     hierarchy=
                    ]{name_of_metadata_entry}

where:

type (of data):

Can be:

  • NUMERIC (INTEGER or DECIMAL)
  • CHARACTERSTRING
  • DATETIME

format:

Supply the format of the data type just defined.

lower_range=0001-01-01:

upper_range=9999-12-31:

Defines the lower and upper range of data that is either continuous or too long to be listed in full.

Example of Range for a date set: 0001-01-01 through 9999-12-31

finite_range:

Defines the range of data that can be easily listed in full.

default_value:

The value that a metadata entry defaults to, if not defined.

optional/required:

Defines if a particular metadata entry is requied or optional within the realm that defines it.

validation_routine (external):

A call to an external routine that validates the entry.

hierarchy:

An element defined recusively in terms of another, previously defined metadata entry.

Definitions and Example of use

Types of data:

There are 3 types of data to consider:

  • NUMERIC (INTEGER or DECIMAL)
  • CHARACTERSTRING
  • DATETIME

DATE Format:

DATE: Format: YYYY-MM-DD DATETIME: Format: YYYY-MM-DD HH:MI:SS

Format of date can be of different types such as: 'dd-mm-yyyy', 'yyyy-mm-dd', 'mm-dd-yyyy'.

To mention the 10 most used formats, with short-century (yy) and with long-century (yyyy).

  • U.S.: mm/dd/yy & mm/dd/yyyy
  • ANSI: yy.mm.dd & yyyy.mm.dd
  • British/French: dd/mm/yy & dd/mm/yyyy
  • German: dd.mm.yy & dd.mm.yyyy
  • Japan: yy/mm/dd & yyyy/mm/dd
  • ISO: yymmdd & yyyymmdd
  • Europe default + milliseconds: dd mon yyyy hh:mi:ss:mmm (24h)
  • Hijri: dd/mm/yyyy hh:mi:ss:mmmAM
RequirePackage{metadata}
\definemetadataentry[type=DATETIME,
                     format=yyyy-mm-dd,
                     lower_range=0001-01-01,
                     upper_range=9999-12-31
                     default_value=2000-12-31,
                     optional,
                     validation_routine=/usr/local/texlive2024/bin/abc.lua,
                    ]{date_of_publication}

\definemetadataentry[type=DECIMAL, fomat=(8,6), ]{latitude}

\definemetadataentry[type=DECIMAL, format=(9,6), ]{llongitude}

\definemetadataentry[type=CHARACTERSTRING]{email}

\definemetadataentry[type=CHARACTERSTRING]{url}

\definemetadataentry[type=NUMERIC, format=integer, lower-range=1, upper-range=9999, ]{volume_number}

\definemetadataentry[type=CHARACTERSTRING(2), range={US,MX,CA,GT,HT,CU,HN,...}, ]{north_american_country}

Hierarchy of metadatada.

\definemetadataentry[type=CHARACTERSTRING(2),
                     required,
                    ]{country}

\definemetadataentry[type=CHARACTERSTRING(10), hierarchy=country, ]{zipcode}

Example of use:

\zipcode[BR]{22430-085}
\zipcode[US]{91106-3840}
\zipcode[CA]{K1A 0T6}
\zipcode[DE]{13057}
\zipcode[IR]{81599-95950}
\definemetadataentry[type=CHARACTERSTRING(2),
                     required,
                    ]{decade}

\definemetadataentry[type=CHARACTERSTRING(10), hierarchy=decade, ]{msc}

Example of use:

\msc[2010]{76B75}
\msc[2020]{76D55}
\definemetadataentry[type=CHARACTERSTRING(100),
                     required,
                    ]{full_name}

\definemetadataentry[type=CHARACTERSTRING(30), hierarchy=full_name, ]{last_name}

Example of use:

\full_name{John Ewing}
\last_name{Ewing}