0

I have multiline text in an \item of {itemize} or in {section} and creating a pdf of the document. When I copy the text from the pdf, there is a newline character at the end of each line. I am creating a document that will be read by an AI tool and splitting sentences will mess it up.

How can I configure it to not add those unnecessary newlines?

For example, if I copy the text from the pdf generated from the following code, there is a newline character between character in and the middle

\begin{itemize}
    \item This is some text that spans multiple lines. I need the pdf to not have a newline 
    character in the middle of the sentence in the copied text
    \item Some more text.
\end{itemize}

enter image description here

Furqan136
  • 55
  • 3

1 Answers1

4

Well this is one of the things the Tagged PDF project is about. If you compile this here in a current TeXsystem with lualatex (which handles real space chars best)

\DocumentMetadata{testphase=phase-III}
\documentclass{article}

\begin{document} \begin{itemize} \item This is some text that spans multiple lines. I need the pdf to not have a newline character in the middle of the sentence in the copied text \item Some more text. \end{itemize} \end{document}

then copy & paste will give

•
This is some text that spans multiple lines. I need the pdf to not have a newline character in the middle of the sentence in the copied text
•
Some more text.

But generally you shouldn't put too much trust in copy&paste from a PDF. The format doesn't contain simple text and that means that every reader has to do some heuristics.

Ulrike Fischer
  • 327,261
  • 2
    @campa I generally choose one at random from to|too|two – Ulrike Fischer Dec 22 '23 at 15:10
  • Thanks for this reference. I understand that you are working on this project and it is not yet in production stage. But, is the current implementation stable with handling simple text and bullet points? – Furqan136 Dec 23 '23 at 19:50