LaTeX PDF to Word using Acrobat Pro: How prevent multiple white-space?

Question

To make use of the grammar and spell-checking of Word, I'd like to use Adobe Acrobat Pro to export a PDF into a Word document.

When doing so, the resulting Word contains multiple white-spaces which results in a bad performance of the spellchecking.

I'd like an alternative for consecutively running several string-replace operations (i.e, two white-spaces replaced by one white-space).

Is there any package/option that can help to prevent multiple whitespace?

Welcome to TeX.SX! The problem is not produced by LaTeX but by the PDF-viewer in question, which interprets the contents as two whitespace. You should perhaps use a (La)TeX-IDE which supports grammar and spell checking like for example TeXStudio or Texmaker. — Skillmon, Oct 27 '17 at 11:14
See also https://tex.stackexchange.com/questions/15/spell-checking-latex-documents. — Marijn, Oct 27 '17 at 11:17
When I want to use the pdf for spell checking I normally add `\geometry{paperwidth=300cm} to the document and sometimes also \raggedright. Then almost all paragraphs are in one line and word space is quite uniform. (I don't export to word, but use then copy & paste). — Ulrike Fischer, Oct 27 '17 at 11:25

score 1 · Answer 1 · answered Oct 27 '17 at 14:30

Unfortunately, I do not believe there is a single catch-all method.

Are you using pdftex to compile, or Luatex or Xetex? Currently, pdftex allows you to use the command \pdfinterwordspaceon (might depend on your version of TeX). This places actual space characters in the PDF. Without that, or when using other compilers that don't have the command, the PDF does not actually contain the space character between words. Separation is by physical placement. This may seem odd, but it is perfectly acceptable in non-archival PDF (that is, except for PDF/A).

When you open a PDF in Acrobat Pro and "add tags" to it, the PDF is examined by an algorithm that looks for white space above a certain limit, and substitutes the space character there. This is how the exported text (or Word document) has spaces in it. Ordinary Adobe Reader does not add tags, so that text extracted from a LaTeX PDF will often be jumbled. The free Okular PDF reader (Linux and Windows) has enough smarts to recognize gaps as spaces, if you don't have Acrobat Pro.

Now to answer the original question: Since Acrobat Pro is not actually reading the space character, but is guessing where spaces should be based on visual appearance, it may insert too many or too few spaces. There's not much you can do about that. At best, when you export the document to Word, you can use Word's own search/replace feature to look for consecutive spaces, and replace them with a single space.

LaTeX PDF to Word using Acrobat Pro: How prevent multiple white-space?

1 Answers1