Using Python and some Linux tools, I'm able to examine the lowest level elements of a PDF. For example, a snippet from one page looks like this:
/TT1 1 Tf -0.0001 Tc 0.896 0 Td (Table)Tj /TT2 1 Tf 0 Tc (@)Tj /TT1 1 Tf -0.0001 Tc (RandomSample)Tj /TT2 1 Tf 0 Tc 10.8 0 Td (@)Tj /TT1 1 Tf -0.0001 Tc (Range#41;Tj /TT2 1 Tf 0 Tc (@)Tj /TT1 1 Tf -0.0001 Tc (39)Tj /TT2 1 Tf 0 Tc (D)Tj /TT1 1 Tf 0.3564 Tc (,7)Tj /TT2 1 Tf 0.2956 Tc 7.557 0 Td [(D\352)296(\352)]TJ /TT1 1 Tf -0.0001 Tc (Sort,)Tj /TT2 1 Tf 0 Tc 5.748 0 Td (8)Tj /TT1 1 Tf -0.0001 Tc [(10)-122(^)-122(6)]TJ /TT2 1 Tf 0 Tc (<D)Tj /TT1 1 Tf (;)Tj /TT0 1 Tf -0.0004 Tc 0.0004 Tw 8 0 0 8 87.4762 451.6921 Tm (
Can I access that level of detail using Mathematica?
I'm talking about accessing this on a page by page basis, not as a binary file.
Update: The text is a snippet of the raw text of a single page of a PDF about Mathematica. It was generated using the following Linux utility:
$ dumppdf -t -p 584 filename.pdf
The syntax in the raw text is described in section 5.3.2 Text-showing Operators of the PDF Reference v1.7. I extracted that part of the raw page text because it is needed to generate a single Mathematica command:
Plus @@ # & /@ {{1, 2, 3}, {2, 3, 4, 5}}
The fonts /TT1 and /TT2 you can see in the snippet are 'Courier-Bold' and 'Mathematica2Mono-Bold'. If I attempt to copy the above MMA command from the PDF and paste it into an MMA Notebook, it is rendered as
Plus üü Ò & êü 881, 2, 3<, 82, 3, 4, 5<<
I was hoping to avoid using dumppdf and extract raw page text using only MMA commands.
StringReplace[Shortest[StringExpression["stream",x__, "endstream"]]:>StringJoin["stream\n",BaseEncode@StringToByteArray[x], "\nendstream"]]@ExportString["Hello", "PDF"]. – rhermans Feb 10 '23 at 12:54