Extracting Text Only from a Thesis

Question

The question Extracting the contents of text in a specified environment into a new file (and answers therein) involve using the extract package to produce a LaTeX file that contains all the text within in a specified environment.

My question is if it is possible to do the reverse? I am forced to work with Microsoft Word users. Due to Word's stability issues we usually maintain a .docx file containing all the text for a paper (or in this case my thesis) and one containing all of the figures and captions. Some journals require this approach as well. The endfloat package gets part of the way there by placing figures at the end of the PDF that is generated. I've had trouble using TeX4ht or latex2rtf because all of my figures are PDF. Figures in PDF format work great with pdftex, which I use to generated the PDF files.

Being a chemist, I use a lot of superscripts and subscripts, in particular with the mhchem package. I was hoping that if I could the text into HTML or RTF I could import to word without have to redo every subscript and superscript in the text. I've tried placing all of the \includegraphics commands within a comment environment (like textonly below) using the comment package:

\begin{figure}
\begin{textonly}
\centering
\includegraphics{figurefile}
\end{textonly}
\label{fig:figure}
\end{figure}

This approach broke all of my references (\ref{fig:figure}) to figures - they typeset as ??.

Is finding a way to compile the LaTeX code into a more Word friendly format the way to go or is converting the PDF generated by pdftex the best approach? I find that I have to manually fix the subscripts and superscripts to be recognized by Word. Additionally, all of the \refs work in the final output PDF, where that may be a problem in a split file approach.

There is a pdf to word converter (free trial): http://www.pdftoword.com/ I haven't tried and can't say anything about it's quality. If you try and it works please give a feedback. — schmendrich, Feb 01 '13 at 07:40
MS Word 2013 allows editing pdf files. Have not tried it, don't know how good it works. — matth, Feb 01 '13 at 10:08
This comes down to the 'workflow for Word' issue, really. I've duped: if I'm wrong, ping me and I'll reopen. — Joseph Wright, Aug 06 '13 at 06:02

score 0 · Answer 1 · answered Jul 31 '13 at 12:30

For the text of a paper here, I ended up writing a word (2003) macro to read in the .tex file. It can handle text-mode sub- and super- scripts, and a few of my own simple macros, and applies some rudimentary formatting to \section etc. However math mode is completely out, as are tables. Images are imported automatically if they exist as .png (use imagemagick first).

The use case here is collaborating on the text of a paper, allowing the use of word's comment and track changes tools.

It's more in the spirit of an MWE than a polished tool, but you're welcome to play with it. It needs a little work before I can upload it (detaching it from some personal stuff) so let me know if (anyone) wants a look.

I've just seen this was an old thread, must have got poked by community, so probably of no interest to @Phillip, but the offer stands. — Chris H, Jul 31 '13 at 12:34

score 0 · Answer 2 · answered Feb 01 '13 at 09:41

There are many options, but the routes involving HTML/XML are nicer than those involving RTF.

Word 2007 and later allows you to save .docx files as HTML files using CSS (as Word 2003 allowed you to save .doc files as really nasty HTML). The output isn't pretty, but using Pandoc to convert this HTML into Markdown is possible.

I've done this, but it was ages ago and I don't recall what the complications were - when I have more time I'll try this out.

Extracting Text Only from a Thesis

2 Answers2