2

I have a document that contains formatting macros. That document includes a document that contains the content.

I'd like to be able to create a second document that redefines the formatting macros so that the resulting PDF can be converted to plain text very successfully. I'm using PDFMiner's pdf2text.py (python), but the document itself needs to be properly suited for conversion (no columns, drawing marks, typeface changes, etc.).

The only problem I'm having is with line wrapping. Is there a way to turn off wrapping for (every) paragraph? I think this would solve my problem.

If there's another way to convert a LaTeX document to text (without me having to code the tool to do it), I'm willing to consider alternatives to my current plan.

  • Try pandoc. If your document is not overly complicated, it is perhaps better to write in markdown and use pandoc to convert to LaTeX rather than the other way around (though pandoc does work in the direction you're talking about, too). Also line wrapping sounds like an editor 'problem', not a *TeX problem..? – jon Jul 25 '13 at 05:05
  • What do you mean exactly by "line wrapping"? Hyphenation? – Werner Jul 25 '13 at 05:40
  • do you actually want to stop line breaking and just have one very long line? or do you just want to stop hyphenation breaking words and adding - in the latter case just add \raggedright – David Carlisle Jul 25 '13 at 07:06
  • @jon, I don't think the document would count as simple. It's full of custom macros to do formatting/layout. I wish to render it for visual beauty/usefulness and for "export to text" without duplicating the text content. I just want to redefine the formatting macros for the export-to-text version. – Jubjub Bandersnatch Jul 25 '13 at 12:44
  • @Werner, LaTeX's default mode is to wrap lines, which is very sensible. The pdf-to-text conversion tool I'm using preserves those broken lines. For the sake of the exported text, I want all the words in one paragraph to be on a single line. – Jubjub Bandersnatch Jul 25 '13 at 12:45
  • PDFtoText is converting at the wrong time, obviously: you need to convert from the actual .tex file, which is what things like pandoc do (also see program detex). However, you could use tex4ht and neutralize your your custom macros with boolean switches: \newif\ifplaintext ... \ifplaintext <neutralize custom macros> \else <your current macros> \fi; then issue \plaintexttrue (and render with tex4ht) or \plaintextfalse as needed. If you want more advanced control, check out the package etoolbox. – jon Jul 25 '13 at 15:21

0 Answers0