7

How can i made a macro similar to \averagecharwidth of ConTeX in LaTeX, that calculates the average width of a character based on the frequency of that character into my document ? that macro is show in this post.

enter image description here

Aurelius
  • 7,653
  • 10
  • 49
  • 103
  • 1
    This looks like a tough challenge! The hard part is to count the frequency of each character in your document. Probably doable using LuaLaTeX, registering some callback to count each character as it reaches TeX's stomach. – Bruno Le Floch Aug 21 '12 at 19:56

2 Answers2

7

The macros are fairly low level TeX, so it is easy to use them in LaTeX by adding a few missing definitions. With these definitions in place, you can simply import lang-frq.mkii,
lang-frd.mkii, and the helper file supp-mis.mkii (on the destination page, click raw to download) and use ConTeXt's \averagecharwidth directly.

% Copy definition of \emptybox from supp-box.mkii
\ifx\voidbox\undefined      \newbox\voidbox \fi
\def\emptybox{\box\voidbox}

% Copy definition of \startnointerference from syst-new.mkii
\newbox\nointerferencebox

\def\startnointerference
  {\setbox\nointerferencebox\vbox
   \bgroup}

\def\stopnointerference
  {\egroup
   \setbox\nointerferencebox\emptybox}

% Load a trimmed down version of ConTeXt macros
\input supp-mis.mkii

\input lang-frq.mkii 
\input lang-frd.mkii

% Set the main language. (I don't know what the LateX equivalent of
% \currentmainlanguage)
\def\currentmainlanguage{en}

\documentclass{article}
\begin{document}
The average character width is \the\averagecharwidth
\end{document}

NOTE: Comment line 116 from lang-frd.mkii (the one that reads \startcharactertable[en] 100 x \stopcharactertable % kind of default).

Aditya
  • 62,301
  • The last link above points to the incorrect page if I understand correctly. It's quite possible that I'm missing something small, but I can't get the above to compile after downloading (presumably the correct) files. Would someone mind giving me a hand? – Scott H. Aug 21 '12 at 23:18
  • @Scott: Can you try again after commenting one line from lang-frd.mkii (see updated answer). – Aditya Aug 22 '12 at 00:27
  • Thanks Aditya, that led me to what was actually going wrong...I had mindlessly right clicked and saved as so I had a bunch of xml files which weren't going to work no matter what I did: PEBKAC – Scott H. Aug 22 '12 at 01:03
6

Here's a naive approach.

  1. Store the entire document in a token list
  2. Count the number of occurrences of each alphabetic character (mostly)
  3. Divide each character count by total number of alphabetic characters to get the relative frequency of that character.
  4. Multiply that ratio by the width of the character and sum to get average character width.

Some notes:

  • brace groups are counted as a single token so things like \begin{environment} and \par won't match any alphabetic characters, this is an advantage.
  • At the same time, the words within \text{some text} won't get counted, this is a disadvantage.
  • Capital letters can be taken into account but it is slow.
  • I don't think that I missed anything significant, but you never know.
  • Edit: Spaces are now included in the calculation, and the effect of the macro is cumulative. In dealing with spaces, I made the assumption that stretching and shrinking cancel one another in the long run and that the average width of a space is just the normal width of a space. Someone please let me know if there's a better way to deal with that.
  • Edit: Compile twice to automatically adjust textwidth to desired value.

Anyway, for straight text, this gives an exact average character width. The result becomes less accurate if more printed text is hidden in brace groups.

\documentclass{article}
\usepackage{xparse}
\usepackage{siunitx}
\usepackage{booktabs}
\usepackage{environ}

\ExplSyntaxOn

\bool_new:N \g_has_run_bool
\tl_new:N \l_aw_text_tl
\int_new:N \l_aw_tot_int
\int_new:N \g_aw_tot_alph_int
\int_new:N \g_wid_space_int
\int_new:N \g_space_int
\fp_new:N \g_rat_space_int
\fp_new:N \g_aw_avg_width_fp
\dim_new:N \myalphabetwidth
\dim_new:N \mytextwidth
\input{testing.aux}
\tl_const:Nx \c_aw_the_alphabet_tl {abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.;?()!' \token_to_str:N :}

% this can be changed to an evironment or renamed or whatever
\NewDocumentCommand {\avgwidthstart} {}
  {
    \aw_avg_width:w
  }

\NewDocumentCommand {\avgwidthend}{}{}

% Here is the environment version, using just "text" as a name is probably a bad idea.
\NewEnviron{awtext}
{
  \expandafter\avgwidthstart\BODY\avgwidthend
}

\makeatletter

\cs_new:Npn \aw_avg_width:w #1 \avgwidthend
  {
    % if first run, then generate variables to be used
    \bool_if:NF \g_has_run_bool
      {
        \tl_map_inline:Nn \c_aw_the_alphabet_tl
        {
          \int_new:c {g_##1_int}
          \fp_new:c {g_rat_##1_fp}
          \fp_new:c {g_wid_##1_fp}
        }
      }
    \tl_set:Nn \l_aw_text_tl {#1}

    % this can be used rather than the preceding line to take capital 
    % letters into account, but is Slooooooow
    %\tl_set:Nx \l_aw_text_tl {\tl_expandable_lowercase:n {#1}}

    \int_set:Nn \l_aw_tot_int {\tl_count:N \l_aw_text_tl}
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_get_counts:n
    \deal_with_spaces:n {#1}
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_calc_ratios:n
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_calc_avg_width:n
    \fp_gset_eq:NN \g_aw_avg_width_fp \l_tmpa_fp
    \fp_zero:N \l_tmpa_fp

    % the dimension \myalphabetwidth gives the width of the alphabet based on your character freq,
    % can be accessed by \the\myalphabetwidth
    \dim_gset:Nn \myalphabetwidth {\fp_to_dim:n {\fp_eval:n {61*\g_aw_avg_width_fp}}}

    % the dimension \mytextwidth gives the recommended \textwidth based on 66 chars per line.
    % can be accessed by \the\mytextwidth
    \dim_gset:Nn \mytextwidth {\fp_to_dim:n {\fp_eval:n {66*\g_aw_avg_width_fp}}}
    \protected@write\@mainaux{}{\mytextwidth=\the\mytextwidth}
    \bool_gset_true:N \g_has_run_bool

    % and lastly print the content
    #1
  }

\makeatother

\cs_new:Npn \aw_get_counts:n #1
  {
    % make a temporary token list from the document body 
    \tl_set_eq:NN \l_tmpb_tl \l_aw_text_tl
    % remove all occurrences of the character
    \tl_remove_all:Nn \l_tmpb_tl {#1}
    % add to appropriate int the number of occurrences of that character in current block
    \int_set:Nn \l_tmpa_int {\int_eval:n{\l_aw_tot_int -\tl_count:N \l_tmpb_tl}}
    % add to appropriate int the number of occurrences of that character in current block
    \int_gadd:cn {g_#1_int} {\l_tmpa_int}
    % add this to the total
    \int_gadd:Nn \g_aw_tot_alph_int {\l_tmpa_int}
  }

\cs_new:Npn \deal_with_spaces:n #1
  {
    \tl_set:Nn \l_tmpa_tl {#1}
    % rescan body with spaces as characters
    \tl_set_rescan:Nnn \l_tmpb_tl {\char_set_catcode_letter:N \ }{#1}
    % find number of new characters introduced.  add to number of spaces and alph chars
    \int_set:Nn \l_tmpa_int {\tl_count:N \l_tmpb_tl -\tl_count:N \l_tmpa_tl}
    \int_gadd:Nn \g_space_int {\l_tmpa_int}
    \int_gadd:Nn \g_aw_tot_alph_int {\l_tmpa_int}
    % since this comes after the rest of chars are dealt with, tot_alph is final total
    \fp_set:Nn \g_rat_space_fp {\g_space_int/\g_aw_tot_alph_int}
    % get width of space and use it.  obviously space is stretchable, so i'll assume
    % that the expansions and contractions cancel one another over large text.  is this
    % a terrible assumption???
    \hbox_set:Nn \l_tmpa_box {\ }
    \fp_gset:Nn \g_wid_space_fp {\dim_to_fp:n {\box_wd:N \l_tmpa_box}}
    \fp_add:Nn \l_tmpa_fp {\g_wid_space_fp*\g_rat_space_fp}
  }

\cs_new:Npn \aw_calc_ratios:n #1
  {
    % divide number of occurrences of char by total alphabetic chars
    \fp_gset:cn {g_rat_#1_fp}{{\int_use:c {g_#1_int}}/\g_aw_tot_alph_int}
  }

\cs_new:Npn \aw_calc_avg_width:n #1
  {
    % only need to find char widths once
    \bool_if:NF \g_has_run_bool
      {
        % find width of char box
        \hbox_set:Nn \l_tmpa_box {#1}
        \fp_gset:cn {g_wid_#1_fp}{\dim_to_fp:n {\box_wd:N \l_tmpa_box}}
      }
    % multiply it by char frequency and add to avg width
    \fp_add:Nn \l_tmpa_fp {{\fp_use:c {g_wid_#1_fp}}*{\fp_use:c {g_rat_#1_fp}}}
  }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This part is just for fun. Delete it and the showtable command from the document if
% it isn't wanted
\tl_new:N \l_aw_tab_rows_tl
\seq_new:N \g_aw_the_alphabet_seq

\NewDocumentCommand {\showtable}{}
    {
      \clearpage
      \aw_make_table:
    }

\cs_generate_variant:Nn \seq_set_split:Nnn {NnV}
\cs_new:Npn \aw_make_table:
    {
      \thispagestyle{empty}
      \seq_set_split:NnV \g_aw_the_alphabet_seq {} \c_aw_the_alphabet_tl
      \seq_map_function:NN \g_aw_the_alphabet_seq \aw_generate_row:n
      \begin{table}
      \centering
      \sisetup{round-mode = places,round-precision = 5,output-decimal-marker={,},table-format = 3.5}
      \begin{tabular}{lll}
        \toprule
        {Average\,text\,width}&{Average\,character\,width}&{Average\,alphabet\,width}\\
        \midrule
        \the\mytextwidth&\fp_eval:n {round(\g_aw_avg_width_fp,5)}pt&\the\myalphabetwidth\\
        \bottomrule
      \end{tabular}\par
      \end{table}
      \vfil
      \centering
      \sisetup{round-mode = places,round-precision = 5,output-decimal-marker={,},table-format = 3.5}
      \begin{longtable}{cS}
        \toprule
        {Letter}&{Actual}\\
        \midrule
        spaces&\fp_eval:n {\g_rat_space_fp*100}\%\\
        \tl_use:N \l_aw_tab_rows_tl
        \bottomrule
      \end{longtable}\par
    }

\cs_new:Npn \aw_generate_row:n #1
    {
      \tl_put_right:Nn \l_aw_tab_rows_tl {#1&}
      \tl_put_right:Nx \l_aw_tab_rows_tl {\fp_eval:n {100*{\fp_use:c {g_rat_#1_fp}}}\%}
      \tl_put_right:Nn \l_aw_tab_rows_tl {\\}
    }

\ExplSyntaxOff

    \begin{document}

    \avgwidthstart
    My audit group's Group Manager and his wife have an infant I can describe only as fierce.
    Its expression is fierce; its demeanor is fierce; its gaze over bottle or pacifier or finger-fierce, 
    intimidating, aggressive. I have never heard it cry. When it feeds or sleeps, its pale face reddens,
    which makes it look all the fiercer.
    \avgwidthend

    \avgwidthstart
    On those workdays when our Group Manager, Mr. Yeagle, brought it in to the District office, hanging papoose-style in a nylon device on his back, the infant appeared to 
    be riding him as a mahout does an elephant. It hung there, radiating authority. Its back lay directly 
    against Mr. Yeagle's, its large head resting in the hollow of its father's neck and forcing our Group 
    Manager's head out and down into a posture of classic oppression. They made a creature with two faces,
    one of which was calm and blandly adult and the other unformed and yet emphatically fierce. The infant 
    never wiggled or fussed in the device. Its gaze around the corridor at the rest of us gathered waiting 
    for the morning elevator was level and unblinking and (it seemed) almost accusing. The infant's face, as 
    I experienced it, was mostly eyes and lower lip, its nose a mere pinch, its forehead milky and domed, 
    its pale red hair wispy, no eyebrows or lashes or even eyelids I could see. I never saw it blink. Its 
    features seemed suggestions only. It had roughly as much face as a whale does. I did not like it at all.\par\noindent
    http://harpers.org/media/pdf/dfw/HarpersMagazine-2008-02-0081893.pdf
    \avgwidthend

    \begin{awtext}
    Here is some more text in an environment this time.  This text is included in the calculation of the average width.
    \end{awtext}
    \showtable{}

    \end{document}

Here are the character frequencies for the given text

Explanation The gist I get from this "average width of a character" thing is the following.

  • People have decided that having ~66 characters per line improves readability of text.
  • Since line width is fixed, the actual number of characters per line depends on which characters are typed e.g, a line of all m's will contain fewer characters than a line of all i's since an m is wider than an i.
  • Thus, to set a reasonable line width to approximate 66 characters per line, we need to know the relative frequencies of the characters that are used in the document. If most of the characters are wide, then we need wider lines. If most of the characters are narrow then we need correspondingly narrower lines.
  • Therefore, we calculate the average width of the characters that are used and use this to determine line width. For example, if our document consists of m's and i's in an equal ratio (50/50), then "the average character" has width somewhere between the width of an m and that of an i. Specifically, the average character has width x=(wd(m)+wd(i))/2 and we should set our \textwidth to 66*x. Extrapolating to an arbitrary document we calculate the weighted average of the widths of the characters used according to their relative frequencies within the document, and multiply this by 66 (or use it in whatever way) to get the \textwidth that best accommodates the 66 character per line criteria.
David Carlisle
  • 757,742
Scott H.
  • 11,047
  • :O thanks ! the value that i get (in the example 4.690139307294571pt) what means? Of what characters is the average width? And how can i hide that value above the text and set up it as a simple length (like \the\textwidth) that showing the total length of the alphabet based on the frequency of the characters in my document? – Aurelius Aug 22 '12 at 12:59
  • This is the average width of a character in your document based on the character frequencies of your document is the ~4.69pt value. If I understand correctly, you would like to use this value to set a \textwidth of ~66 characters per line? – Scott H. Aug 22 '12 at 17:17
  • I don't understand, of what character of the alphabet? right; i want to use to set the \textwidth – Aurelius Aug 22 '12 at 17:45
  • That is the width of the average character in the alphabet based on the frequencies in your document. If your document consisted of all i's and j's then the average width of a character will be smaller than if your document consisted of all m's and w's. Maybe I'll edit the question to update and explain. – Scott H. Aug 22 '12 at 17:51
  • @ScottH. You sometimes have several fp operations in a row: you can simplify those and avoid temporary floating point variables by directly using things like \fp_use:c { l_rat_#1_fp }. Also, use \fp_use:c not \fp_eval:c. In general, you should only make c varaints of N-type arguments, not n-types, because those typically expect a braced argument rather than a single token. – Bruno Le Floch Aug 22 '12 at 18:56
  • In fact, since \seq_mapthread_function:NNN is expandable, and since you can make the function it maps expandable too (with some work), you ould move that to the table body. Actually, I'm now wondering why we don't have a \seq_mapthread_inline:NNn. By the way, any opinion on the name mapthread? – Bruno Le Floch Aug 22 '12 at 19:04
  • @BrunoLeFloch Thanks for the tips! I appreciate you taking the time to make suggestions. I don't know if there's really an obvious name for that function, mapthread seems as good as any. Possibly, map_paired? – Scott H. Aug 22 '12 at 19:12
  • is possible make a thing like this? :

    \begin{document} \avgwidthstart \input{Dolor} \avgwidthend \end{document}

    – Aurelius Aug 22 '12 at 19:15
  • or declare some \avgwidthstart \avgwidthend into the all document and finally sum all the results of declarations.. so you might as well choose what counts in the average – Aurelius Aug 22 '12 at 19:21
  • The first I'm not sure how to do, the second is straightforward with a little fiddling. I can update the answer with your second request if you like? – Scott H. Aug 22 '12 at 19:37
  • sure, you are too kind ! :D is not a good idea add this feature for you? – Aurelius Aug 22 '12 at 20:06
  • lol, I probably should have done that from the beginning. Just wanted to make sure you actually wanted it. If not today, then I'll do it tomorrow. – Scott H. Aug 22 '12 at 20:13
  • The \input{Dolor} case that FormlessCloud cites is why I advised to do it in LuaTeX: then it would be possible to really count characters which are typeset. – Bruno Le Floch Aug 23 '12 at 11:14
  • but by declaring various \avgwidthstart \avgwidthend into the entire document and therefore, selecting the text that you want count to the average, you fix that problem however, or not? @Scott "Edit: Very naive, I didn't realize that the 66 characters per line included spaces! Counting spaces seems tough, I'll see if I can include it." this problem is fixable? – Aurelius Aug 23 '12 at 12:25
  • Spaces are included now. @BrunoLeFloch Unfortunately, I don't know a lick of Lua at the moment, that's something I hope to learn in the future. – Scott H. Aug 23 '12 at 20:14
  • perfect ! look at the picture in my question, is possible made that update? – Aurelius Aug 24 '12 at 20:15
  • \newenvironment{text}{{\aw_avg_width:w}}{} why if i rename the enviroment give errors? – Aurelius Aug 24 '12 at 20:31
  • Ok: I added the environment, and changed to %'s rather than decimals. Hopefully in my copy pasting I didn't mess anything up! – Scott H. Aug 24 '12 at 21:16
  • i have found one important problem try to set: textwidth=\mytextwidth with geometry package, and see the effect ! – Aurelius Aug 24 '12 at 22:06
  • Yes, if you try to do that at the start of the document, \mytextwidth has not yet been defined! Even mid-document, you would first need to make sure that all of the commands collecting text to calculate the average width have been processed before trying to explicitly set it. There are likely other ways but I don't think you can avoid two compilations. Either (1) compile once, determine the width and then explicitly set it for the next compile, or (2) write mytextwidth to the aux file or something at the end of the doc and then process that on the second compile. – Scott H. Aug 24 '12 at 22:44
  • To just compile twice and have the margins set: (1) add \input{jobname.aux} where jobname is the filename of your document, and (2) add \protected@write\@mainaux{}{\textwidth=\the\mytextwidth} after the line \dim_gset:Nn \mytextwidth in the definition of \aw_avg_width. You'll need to put a \makeatletter \makeatother pair around the definition of \aw_avg_width. Actually, I'll just do that. – Scott H. Aug 25 '12 at 01:43
  • see this: http://www65.zippyshare.com/v/8553418/file.html i would simply this configuration – Aurelius Aug 25 '12 at 13:30
  • Change the command in the write to this: \mytextwidth=\the\mytextwidth, and move the \input command to immediately after the *_new commands at the start. Also, you'll need to put \makeatletter and \makeatother around the definition of \aw_avg_width. – Scott H. Aug 25 '12 at 17:03
  • Thanks, i go to try. One thing, in the count are computed the simbols like : ; , . !? ecc ? – Aurelius Aug 25 '12 at 17:33
  • i have understand is impossible use the command \mytextwidth i must copy the value of that command and insert it into the geometry package command as a value in pt not as a command. Right? – Aurelius Aug 25 '12 at 18:39
  • With my suggestion above and with the inclusion of the geometry command as in your link above it compiled fine for me. You have to compile twice, once two write the value to the aux file and once to read it from the input. – Scott H. Aug 25 '12 at 19:31
  • good ! i have added one update at the code, check if is correct. The width calculated seem to be correct, i have try to check some type line and in that lines the character numbers oscillates between 61 and 70. Sin that is not automatical the process... – Aurelius Aug 25 '12 at 21:31
  • this is the final count of characters Total\,characters\,=\,\fp_eval:n {\g_aw_tot_alph_int} ? – Aurelius Aug 28 '12 at 10:00
  • because if i count the characters with MS Word the count is different.. – Aurelius Aug 28 '12 at 10:06
  • It is the number of tokens that are single alphabetic characters. It depends on what you're pasting into word and whether you've turned on the captialization (commented line in code above). Word counts punctuation and counts each space separately whereas TeX treats multiple spaces as one. I don't know how Word deals with indents, possibly it counts those as 4 spaces each. With that taken into account, I get the same character count for a small sentence or two. From that I assume that those things are what is causing the different results. – Scott H. Aug 28 '12 at 15:03
  • i have found one motivation, if we add into this:\tl_const:Nn \c_aw_the_alphabet_tl {abcdefghijklmnopqrstuvwxyz} the simbols .,!?:; like this: \tl_const:Nn \c_aw_the_alphabet_tl {abcdefghijklmnopqrstuvwxyz.,!?:;} the final count is more precisely. What do you think about this? there is a more correct way to count that symbols ? for example in a text of 5570 characters, latex count 5301 characters the difference come from that symbols for me, and also the the capital letters – Aurelius Aug 28 '12 at 16:42
  • I'm not sure if there's a better way. If you would like to count punctuation as well then that's a good idea! – Scott H. Aug 29 '12 at 05:51
  • Sorry, but I have another request: how can I remove the mechanism that give the theorical frequencies and the difference between the values, and leave in the table only the letters and the corresponding frequencies? Is possible modify also the awtext environment for use it also without the \avgwidthstart\avgwidthend pair and so cancel that pair of commands? Thanks. – Aurelius Sep 16 '12 at 20:07
  • @ScottH Good answer but does this take into account the size (small, huge, ...) and scaling? –  Sep 16 '12 at 21:22
  • @MarcvanDongen Unfortunately, it doesn't take any of those into account. I guess that's the primary limitation of this approach. As Bruno suggested above, probably to get the best approximation (with a reasonable amount of effort) Lua would need to be involved. I think that the methods in the solution http://tex.stackexchange.com/a/58327/14100 could be used to get a better approximation. – Scott H. Sep 17 '12 at 02:51
  • @FormlessCloud The table could be modified by deleting rows 3 and 4 from the generate row macro, removing references to #2, changing mapthread_function to map_function and removing the theor_rats sequence. I'm not at a computer right now however. I'm not sure I understand why you want to remove the start and end macros from the environment? – Scott H. Sep 17 '12 at 03:00
  • Thanks I have tryed your changes but don't work. I have updated your answer with your indications, so that you can see where is the problem and I have added also a way to take into the account the capital letters say me what do you think about it ! ;). I would to use for analyze the text only the environment, but now if I use only it I have error. I can use that environment only togheter with the star end pair, and so, not being no longer necessary, delete those commands. (I want to remove the teorical sequence because I need to write also in non english languages). – Aurelius Sep 17 '12 at 10:05
  • @MarcvanDongen but in the 99% of cases the 99% of text is written in normal size, so I don't see a true problem or not? – Aurelius Sep 17 '12 at 10:13
  • @FormlessCloud I fixed your edits so that it should now work. Using the environment exclusively works fine for me, I don't know why it doesn't work for you. – Scott H. Sep 18 '12 at 00:48
  • Thanks ! the error that I get if I use the environment exclusively is from this code, into the longtable environment: \fp_eval:n {\g_rat_space_fp*100} if I cancel it i don't have any problem... Also I have added:\usepackage{longtable} to compile your answer. Where can I read about this programming language, only into the xparse pdf? – Aurelius Sep 18 '12 at 10:51
  • Actually, you're right...I don't get the same error, but using the environment exclusively doesn't count spaces correctly. I don't know how to fix that. You can get more information about expl3 here: http://mirror.hmc.edu/ctan/macros/latex/contrib/l3kernel/interface3.pdf – Scott H. Sep 18 '12 at 18:17