0

How can I generate both PDF and text version with a single .tex file? Text version requirements documented below.

Input

For example, given this .tex file (texlive.net PDF generator online):

\documentclass[11pt,a4paper]{article}

% I doubt this will affect the PDF to text solution, % but I've included it to make it as similar as possible to my real document. \usepackage[ paperheight=11.00in, paperwidth=8.50in, margin=1.00in, top=1.00in, left=1.00in, bottom=1.00in ]{geometry}

\usepackage[hidelinks]{hyperref}

% This part is \input from another file. % Included inline for your convenience. \hypersetup{ pdfinfo={ Author={tfstwbbnb}, } }

\newcommand{\authorName}{tfstwbbnb}

% xelatex required, pdflatex does not work % \setmainfont{Ubuntu Light}[ % ItalicFont=Ubuntu Light Italic, % BoldFont=Ubuntu, % BoldItalicFont=Ubuntu Italic, % ]

\setlength\parindent{0pt} \pagenumbering{gobble}

\usepackage{xcolor} \newcommand{\gray}[1]{\textcolor{gray}{#1}}

\usepackage{setspace} \setstretch{1.10}

% https://tex.stackexchange.com/a/50510 \newcommand{\fitline}[1]{\makebox[\linewidth][s]{#1}}

\newcommand{\myInnerSpacing}{0.40\baselineskip}

\hypersetup{ pdfinfo={ Title={tfstwbbnb demo}, } }

\newcommand{\optionalOne}{optionalOne} \newcommand{\optionalTwo}{optionalTwo} % Links should appear as link text ("requiredOne") in text version. \newcommand{\requiredOne}{\href{mailto:invalid@example.com}{requiredOne}} \newcommand{\requiredTwo}{requiredTwo}

\begin{document} % Alignment in text version does not matter to me. Can be left-justified or centered. \begin{center} \LARGE{\textbf{Title}} \end{center}

\vspace{\myInnerSpacing}

\optionalOne \\
% Optionals might be commented out like so:
% \optionalTwo \\
\requiredOne \\
\gray{\requiredTwo} \\

\vspace{\myInnerSpacing}

% Text formmatting should be stripped in text version.
\textbf{Lorem ipsum dolor sit amet}, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \\

Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean. \\

Felis donec, \\
tfstwbbnb

\end{document}

input

Expected

How can I have it output (as plain text):

Title

optionalOne requiredOne requiredTwo

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean.

Felis donec, tfstwbbnb

Copy and Paste

Opening up the PDF in a viewer and copy/paste gives:

TitleoptionalOnerequiredOnerequiredTwoLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt utlabore et dolore magna aliqua.Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean.Felis donec,tfstwbbnb

pdftotext

pdftotext gives better results, but still not as I want (missing newlines, too many newlines, extra 0x0c character at end):

Title
optionalOne
requiredOne
requiredTwo
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean.
Felis donec,
tfstwbbnb

PDF to text summary

Essentially, the plaintext output should be:

  • All coloring (\textcolor{...}) ignored
  • All font sizing (\LARGE, \small) ignored
  • All links (requiredOne) displayed as text
  • All explicit newlines kept (for example, between the optionalOne and requiredOne)
  • All paragraphs kept (for example, between the Lorem ipsum ... and Purus semper ...)
tfstwbbnb
  • 51
  • 4

2 Answers2

1

You can add visible paragraph breaks, then remove:

pdflatex '\AddToHook{para/after}{\hbox{PARA}}\input'  cc873
pdftotext cc873.pdf
sed -i -e 's/^PARA//' -e 's/[\f]//' cc873.txt
cat cc873.txt

Produces

Title

optionalOne requiredOne requiredTwo

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean.

Felis donec, tfstwbbnb

(You could remove trailing blank lines with sed as well if they are an issue, I just remove the ^L here.

David Carlisle
  • 757,742
  • Your result definitely looks much better than mine, however, there is still an extra newline after "ut" in the first paragraph. Is there a way to fix this too? I'm assuming it would be AddToHook{something}, but I don't know where to look for that "something". – tfstwbbnb Nov 04 '22 at 20:37
  • Found https://ctan.math.illinois.edu/macros/latex/base/lthooks-doc.pdf#page=4, but no hook names. – tfstwbbnb Nov 04 '22 at 20:44
  • @tfstwbbnb texdoc ltpara-doc – David Carlisle Nov 04 '22 at 21:22
  • @tfstwbbnb you could redefine \\ to be say EOL\newline although it's harder than it should be due to the mis-used \\ in the source which are an error Underfull \hbox (badness 10000) in paragraph at lines 76--77 – David Carlisle Nov 04 '22 at 21:29
  • Can you expand on how redefining \\ to EOL\newline would help with the "ut" in the middle of a paragraph being split with a newline? Anyway, this particular problem can be solved with paperwidth=1000in. I will accept your solution shortly. – tfstwbbnb Nov 04 '22 at 22:09
  • @tfstwbbnb actually I misread your comment – David Carlisle Nov 04 '22 at 22:10
1

Pandoc gets it almost right:

pandoc yourfile.tex -f latex -t plain -o yourfile.txt --wrap=none

results in

Title

optionalOne requiredOne requiredTwo

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean. Felis donec, tfstwbbnb

The only issue is the empty lines between the Lorem ipsum sentences. However, the combination of \\ followed by an empty line is not very clean LaTeX, I understand that Pandoc refuses to convert that :)

If instead empty lines between paragraphs are created with \usepackage[parfill]{parskip} and without any \\ at the end of a paragraph then Pandoc converts the empty lines perfectly. LaTeX code:

\documentclass[11pt,a4paper]{article}

% I doubt this will affect the PDF to text solution, % but I've included it to make it as similar as possible to my real document. \usepackage[ paperheight=11.00in, paperwidth=8.50in, margin=1.00in, top=1.00in, left=1.00in, bottom=1.00in ]{geometry}

\usepackage[hidelinks]{hyperref} \usepackage[parfill]{parskip}

% This part is \input from another file. % Included inline for your convenience. \hypersetup{ pdfinfo={ Author={tfstwbbnb}, } }

\newcommand{\authorName}{tfstwbbnb}

% xelatex required, pdflatex does not work % \setmainfont{Ubuntu Light}[ % ItalicFont=Ubuntu Light Italic, % BoldFont=Ubuntu, % BoldItalicFont=Ubuntu Italic, % ]

\setlength\parindent{0pt} \pagenumbering{gobble}

\usepackage{xcolor} \newcommand{\gray}[1]{\textcolor{gray}{#1}}

\usepackage{setspace} \setstretch{1.10}

% https://tex.stackexchange.com/a/50510 \newcommand{\fitline}[1]{\makebox[\linewidth][s]{#1}}

\newcommand{\myInnerSpacing}{0.40\baselineskip}

\hypersetup{ pdfinfo={ Title={tfstwbbnb demo}, } }

\newcommand{\optionalOne}{optionalOne} \newcommand{\optionalTwo}{optionalTwo} % Links should appear as link text ("requiredOne") in text version. \newcommand{\requiredOne}{\href{mailto:invalid@example.com}{requiredOne}} \newcommand{\requiredTwo}{requiredTwo}

\begin{document} % Alignment in text version does not matter to me. Can be left-justified or centered. \begin{center} \LARGE{\textbf{Title}} \end{center}

\optionalOne \\
% Optionals might be commented out like so:
% \optionalTwo \\
\requiredOne \\
\gray{\requiredTwo}

% Text formmatting should be stripped in text version.
\textbf{Lorem ipsum dolor sit amet}, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Purus semper eget duis at tellus at. Tellus cras adipiscing enim eu turpis egestas pretium aenean.

Felis donec, \\
tfstwbbnb

\end{document}

The pdf created from this code is almost the same as in the question, the difference being that the empty lines are a bit smaller. If you want the big gaps then use

\usepackage[skip=\baselineskip]{parskip}

instead.

Marijn
  • 37,699