1

This question is close to my question here but in this case I want to create an alphanumerical hash string (0-9, A-Z, only capital letters) of 4 digits (in best case adaptable) depending from an input string and not random.

In this answer I got a great approach how to handle this with an md5 hash. The problem here is, that this hash consists just out of ASCII hex sign (0-9 / A-F). This is not really useable, because there I got a lot of duplicates with 3-4 digits. Another problem is, that my bitbucket pipeline can not handle the L3 cmd \str_mdfive_hash:e.

In the best case the solution should have a build in duplicates check and throws an error, if a duplicate appears. But this is not mandatory, I can do this by myself :)

At the end, the hash should be used to put it into a table created with pgfPlots.

PascalS
  • 826
  • 3
    the md5 sum (which is a pdftex primitive, you don't need expl3) is a 128bit number expressed in hex. so you can split it up into 4 sets of 8 hex digits then apply any function that maps that range to at most 2916=36*36 and express the result in base 36 using 0-9A-Z then you will end up with a string of length exactly 4. – David Carlisle Mar 12 '24 at 16:59
  • What do you mean about not “handling \str_mdfive_hash:n? – egreg Mar 12 '24 at 17:29
  • 1
    Apart from the fact that the code in egreg's answer is much more intelligent than using only the first three characters, all you're missing is \cs_generate_variant:Nn \str_mdfive_hash:n { e } for it to work in your CI, I'd guess. – Skillmon Mar 12 '24 at 17:36
  • wow, what a fast response! @Skillmon where do I have to add this? Immediately below of \NewExpandableDocumentCommand? Sorry for this stupid question, I‘m absolutely not familar with this syntax… – PascalS Mar 12 '24 at 17:52
  • @egreg Could you add this to your answer? – PascalS Mar 12 '24 at 17:52

3 Answers3

3

Five hexadecimal digits will output a number that has at most four base-36 digits, because

165 = 1048576 < 1679616 = 364

I can't think to \str_mdfive_hash:n not being available, because it's in the LaTeX kernel.

\documentclass{article}

\ExplSyntaxOn

\NewExpandableDocumentCommand{\myhash}{m} { \int_to_Base:nn { \int_from_hex:e { \str_range:enn { \str_mdfive_hash:n { #1 } } { 1 } { 5 } } } { 36 } } \cs_generate_variant:Nn \str_range:nnn { e } \cs_generate_variant:Nn \int_from_hex:n { e }

\ExplSyntaxOff

\begin{document}

\ttfamily

\myhash{abcdefghijklmnopqrstuvwxyz}

\myhash{a}

\myhash{bc}

\end{document}

enter image description here

egreg
  • 1,121,712
  • The answer in the linked question was using \str_mdfive_hash:e, because #1 needed to get expanded. Nowadays it's built in, I guess the older versions didn't have the variant, so all that's missing is a \cs_generate_variant:Nn \str_mdfive_hash:n { e }. – Skillmon Mar 12 '24 at 17:34
  • @Skillmon I like your approach to warn when duplicates are found. Is it possible to combine it with this solution? – PascalS Mar 12 '24 at 17:54
2

The following combines the great answer by @egreg with my older answer to implement this with your pgfplotstable code. All function variants should be generated, even if they are part of the current kernel, so this should work in your CI (and it does no harm if used on current systems). I changed the seq to store the known hashes to a clist, as we know the hashes won't contain any commas this should be a bit faster when searching for duplicates.

\documentclass[10pt,a4paper]{article}
\usepackage{pgfplotstable}
\pgfplotsset{compat=newest}

\ExplSyntaxOn \cs_generate_variant:Nn \str_set:Nn { Ne } \cs_generate_variant:Nn \str_mdfive_hash:n { e } \cs_generate_variant:Nn \str_range:nnn { e } \cs_generate_variant:Nn \int_from_hex:n { e } \cs_generate_variant:Nn \msg_error:nnn { nnV } \cs_generate_variant:Nn \clist_gput_right:Nn { NV } \prg_generate_conditional_variant:Nnn \clist_if_in:Nn { NV } { TF }

\str_new:N \l__pascals_hash_str \clist_new:N \g__pascals_hashes_clist \msg_new:nnn { pascals } { duplicate-hash } { Hash~ #1~ already~ used! } \cs_new:Npn __pascals_calc_hash_aux:n #1 { \int_to_Base:nn { \int_from_hex:e { \str_range:enn { \str_mdfive_hash:e {#1} } \c_one_int { 5 } } } { 36 } } \cs_new_protected:Npn __pascals_calc_hash:n #1 { \str_set:Ne \l__pascals_hash_str { __pascals_calc_hash_aux:n {#1} } \clist_if_in:NVTF \g__pascals_hashes_clist \l__pascals_hash_str { \msg_error:nnV { pascals } { duplicate-hash } \l__pascals_hash_str } { \clist_gput_right:NV \g__pascals_hashes_clist \l__pascals_hash_str } \pgfkeyslet { /pgfplots/table/create~ col/next~ content } \l__pascals_hash_str } \NewDocumentCommand \clearHashes {} { \clist_gclear:N \g__pascals_hashes_clist } \NewDocumentCommand \calcHash { m } { __pascals_calc_hash:n {#1} } \ExplSyntaxOff

\pgfplotstableread[]{ X Y 1 a 2 b 5 c }\mydata

\begin{document}

\clearHashes \pgfplotstablecreatecol[ create col/assign/.code={% \calcHash{\thisrow{X}\thisrow{Y}}% }]{ID}{\mydata} \pgfplotstablegetrowsof{\mydata} \pgfmathtruncatemacro\myDataRows{\pgfplotsretval-1}

\pgfplotstabletypeset[string type]{\mydata} \end{document}


A variant that directly uses \pdfmdfivesum instead of \str_mdfive_hash:e:

\documentclass[10pt,a4paper]{article}
\usepackage{pgfplotstable}
\pgfplotsset{compat=newest}

\ExplSyntaxOn \cs_generate_variant:Nn \str_set:Nn { Ne } \cs_generate_variant:Nn \str_range:nnn { e } \cs_generate_variant:Nn \int_from_hex:n { e } \cs_generate_variant:Nn \msg_error:nnn { nnV } \cs_generate_variant:Nn \clist_gput_right:Nn { NV } \prg_generate_conditional_variant:Nnn \clist_if_in:Nn { NV } { TF }

\str_new:N \l__pascals_hash_str \clist_new:N \g__pascals_hashes_clist \msg_new:nnn { pascals } { duplicate-hash } { Hash~ #1~ already~ used! } \cs_new:Npn __pascals_calc_hash_aux:n #1 { \int_to_Base:nn { \int_from_hex:e { \str_range:enn { \pdfmdfivesum {#1} } \c_one_int { 5 } } } { 36 } } \cs_new_protected:Npn __pascals_calc_hash:n #1 { \str_set:Ne \l__pascals_hash_str { __pascals_calc_hash_aux:n {#1} } \clist_if_in:NVTF \g__pascals_hashes_clist \l__pascals_hash_str { \msg_error:nnV { pascals } { duplicate-hash } \l__pascals_hash_str } { \clist_gput_right:NV \g__pascals_hashes_clist \l__pascals_hash_str } \pgfkeyslet { /pgfplots/table/create~ col/next~ content } \l__pascals_hash_str } \NewDocumentCommand \clearHashes {} { \clist_gclear:N \g__pascals_hashes_clist } \NewDocumentCommand \calcHash { m } { __pascals_calc_hash:n {#1} } \ExplSyntaxOff

\pgfplotstableread[]{ X Y 1 a 2 b 5 c }\mydata

\begin{document}

\clearHashes \pgfplotstablecreatecol[ create col/assign/.code={% \calcHash{\thisrow{X}\thisrow{Y}}% }]{ID}{\mydata} \pgfplotstablegetrowsof{\mydata} \pgfmathtruncatemacro\myDataRows{\pgfplotsretval-1}

\pgfplotstabletypeset[string type]{\mydata} \end{document}


Yet another variant that will also display leading zeroes.

\documentclass[10pt,a4paper]{article}
\usepackage{pgfplotstable}
\pgfplotsset{compat=newest}

\ExplSyntaxOn \cs_generate_variant:Nn \str_set:Nn { Ne } \cs_generate_variant:Nn \str_mdfive_hash:n { e } \cs_generate_variant:Nn \str_range:nnn { e } \cs_generate_variant:Nn \int_from_hex:n { e } \cs_generate_variant:Nn \msg_error:nnn { nnV } \cs_generate_variant:Nn \clist_gput_right:Nn { NV } \prg_generate_conditional_variant:Nnn \clist_if_in:Nn { NV } { TF }

\str_new:N \l__pascals_hash_str \clist_new:N \g__pascals_hashes_clist \msg_new:nnn { pascals } { duplicate-hash } { Hash~ #1~ already~ used! } \cs_new:Npn __pascals_calc_hash_aux:n #1 { \int_to_Base:nn { \int_from_hex:e { \str_range:enn { \str_mdfive_hash:e {#1} } \c_one_int { 5 } } } { 36 } } \cs_new_protected:Npn __pascals_calc_hash:n #1 { \str_set:Ne \l__pascals_hash_str { __pascals_calc_hash_aux:n {#1} } \str_set:Ne \l__pascals_hash_str { \prg_replicate:nn { 4 - \str_count:N \l__pascals_hash_str } { 0 } \l__pascals_hash_str } \clist_if_in:NVTF \g__pascals_hashes_clist \l__pascals_hash_str { \msg_error:nnV { pascals } { duplicate-hash } \l__pascals_hash_str } { \clist_gput_right:NV \g__pascals_hashes_clist \l__pascals_hash_str } \pgfkeyslet { /pgfplots/table/create~ col/next~ content } \l__pascals_hash_str } \NewDocumentCommand \clearHashes {} { \clist_gclear:N \g__pascals_hashes_clist } \NewDocumentCommand \calcHash { m } { __pascals_calc_hash:n {#1} } \ExplSyntaxOff

\pgfplotstableread[]{ X Y 1 a 2 b 5 c 36020001400 BasementFloor }\mydata

\begin{document}

\clearHashes \pgfplotstablecreatecol[ create col/assign/.code={% \calcHash{\thisrow{X}\thisrow{Y}}% }]{ID}{\mydata} \pgfplotstablegetrowsof{\mydata} \pgfmathtruncatemacro\myDataRows{\pgfplotsretval-1}

\pgfplotstabletypeset[string type]{\mydata} \end{document}

Skillmon
  • 60,462
  • Funny, it works but also gives me an undefined control sequence for \str_mdfive_hash:n in my CI and even in overleaf (standard settings). – PascalS Mar 12 '24 at 20:48
  • And when i add a second character to the columns I get 5 digits instead of 4 – PascalS Mar 12 '24 at 20:56
  • @PascalS I just checked, \str_mdfive_hash:n is indeed rather young, it was added to the kernel on 2023-05-19. Just use \exp_args:Ne \pdfmdfivesum instead of \str_mdfive_hash:e in your code for the CI, as long as you run pdfTeX this should work. – Skillmon Mar 13 '24 at 10:06
  • Overleaf is working now! My CI is compiling now, but throws wrong error: ! Package pascals Error: Hash BCQP already used! Also in overleaf I don‘t see a hash BCQP. – PascalS Mar 13 '24 at 10:39
  • @PascalS if it throws an error, the code says it is duplicate. Nothing I can do about it, sorry. You might want to try bisecting through your data to find the offending line. – Skillmon Mar 13 '24 at 10:59
  • I reduced my data to just have 3 lines and it gives me still the error. What I think about is, that some of the other cmd‘s are misunderstood by the compiler. I think in this direction, because before I changed your lines regarding the \str_mdfive_hash:e function, Overleaf has calculated completely different hash sums like now. Which should not be the case for same input data. Apart of that, the generated hashes were very similar (just 1 digit was different). So the whole calculation seems to be curruped in these cases. – PascalS Mar 13 '24 at 12:31
  • 1
    @PascalS please compare the code you use to the one posted above (I added a variant directly calling \pdfmdfivesum). Also make sure you're running pdfTeX. – Skillmon Mar 13 '24 at 14:59
  • Now it compiles and works! But sometimes it produces just 3 digit hashes! For X=36020001400 and Y=BasementFloor I get just Q1U – PascalS Mar 13 '24 at 15:36
  • Can you reproduce this behaviour or is it related to my build environment? Overleafes shows me the same result without any errors. – PascalS Mar 13 '24 at 15:41
  • @PascalS that's just a leading zero not being displayed. – Skillmon Mar 13 '24 at 17:57
  • @PascalS I've posted yet another version, this one will display leading zeroes, so you always get exactly 4 symbols. – Skillmon Mar 13 '24 at 18:04
  • Thank you so much for your effort! It works now :) – PascalS Mar 13 '24 at 19:02
1

a plain (e)tex version


\def\foo#1{\immediate\write20{% \expandafter\fooa\pdfmdfivesum{#1}% \foob\foob\foob\foob\foob\foob\foob\foob }}

\def\fooa#1#2#3#4#5#6#7#8{% \ifx\foob#1\else \basethirtysix{\hex{#1}+\hex{#2}+\three{#3}}% \expandafter\fooa\fi } \def\foob{}

\def\hex#1{% \if#1A10\else \if#1B11\else \if#1C12\else \if#1D13\else \if#1E14\else \if#1F15\else #1% \fi\fi\fi\fi\fi\fi }

%16+16=32 so we can add 0,1,2,3 from #3 to use full 0-35 range \def\three#1{\if#111\else\if#1A2\else\if#1B3\else0\fi\fi\fi}

\def\basethirtysix#1{\ifcase\numexpr0#1\relax 0\or1\or2\or3\or4\or5\or6\or7\or8\or9\or A\or B\or C\or D\or E\or F\or G\or H\or I\or J\or K\or L\or M\or N\or O\or P\or Q\or R\or S\or T\or U\or V\or W\or X\or Y\else Z\fi}

\foo{hello world}

\foo{$ \sqrt{x}$}

\foo{some longer text that must be hashed}

\bye

logs

MFCN
GCT9
MLDQ
David Carlisle
  • 757,742
  • Interesting what‘s possible with plain tex. This example is compiling at my end, but I‘m struggeling to get it working within my pgfPlots table (like Skillmon mentioned it in his answer) – PascalS Mar 12 '24 at 21:23
  • 1
    @PascalS this form just writes to the terminal, remove the \immediate\write20{ so that it puts the hash in the document – David Carlisle Mar 12 '24 at 21:39