1

Using Python and some Linux tools, I'm able to examine the lowest level elements of a PDF. For example, a snippet from one page looks like this:

/TT1 1 Tf -0.0001 Tc 0.896 0 Td (Table)Tj /TT2 1 Tf 0 Tc (@)Tj /TT1 1 Tf -0.0001 Tc (RandomSample)Tj /TT2 1 Tf 0 Tc 10.8 0 Td (@)Tj /TT1 1 Tf -0.0001 Tc (Range#41;Tj /TT2 1 Tf 0 Tc (@)Tj /TT1 1 Tf -0.0001 Tc (39)Tj /TT2 1 Tf 0 Tc (D)Tj /TT1 1 Tf 0.3564 Tc (,7)Tj /TT2 1 Tf 0.2956 Tc 7.557 0 Td [(D\352)296(\352)]TJ /TT1 1 Tf -0.0001 Tc (Sort,)Tj /TT2 1 Tf 0 Tc 5.748 0 Td (8)Tj /TT1 1 Tf -0.0001 Tc [(10)-122(^)-122(6)]TJ /TT2 1 Tf 0 Tc (<D)Tj /TT1 1 Tf (;)Tj /TT0 1 Tf -0.0004 Tc 0.0004 Tw 8 0 0 8 87.4762 451.6921 Tm (

Can I access that level of detail using Mathematica?

I'm talking about accessing this on a page by page basis, not as a binary file.

Update: The text is a snippet of the raw text of a single page of a PDF about Mathematica. It was generated using the following Linux utility:

$ dumppdf -t -p 584 filename.pdf

The syntax in the raw text is described in section 5.3.2 Text-showing Operators of the PDF Reference v1.7. I extracted that part of the raw page text because it is needed to generate a single Mathematica command:

Plus @@ # & /@ {{1, 2, 3}, {2, 3, 4, 5}}

The fonts /TT1 and /TT2 you can see in the snippet are 'Courier-Bold' and 'Mathematica2Mono-Bold'. If I attempt to copy the above MMA command from the PDF and paste it into an MMA Notebook, it is rendered as

Plus üü Ò & êü 881, 2, 3<, 82, 3, 4, 5<<

I was hoping to avoid using dumppdf and extract raw page text using only MMA commands.

Jim Marks
  • 187
  • 4
  • Yes, it is possible to read a PDF at the lowest level, which necessarily will be on "binary", however a large portion of the data is just plain text. Binary should not scare you. I have read and manipulated PDF files at the lowest level in this answer. There I read and update the metadata of the PDF. Does that help? If not, what is missing? BTW, this it's complicated. You will need to understand how the PDF file format works, if you want to do anything but the most simple substitutions. Just inserting text will mess up with the file. – rhermans Feb 10 '23 at 11:03
  • Please [edit] your question to provide a minimum working example of a Mathematica synthetically generated PDF and a detailed explanation of the modification you want to do. Also explain why you want to do this. Given the complexity of the problem, very likely alternative approaches will be more convenient. – rhermans Feb 10 '23 at 11:06
  • As an example, try this : StringReplace[Shortest[StringExpression["stream",x__, "endstream"]]:>StringJoin["stream\n",BaseEncode@StringToByteArray[x], "\nendstream"]]@ExportString["Hello", "PDF"]. – rhermans Feb 10 '23 at 12:54
  • @rhermans I think my original post wasn't clear about what I need to do. I have an online PDF book about MMA. I want to copy an MMA command and paste into front-end. My original PDF snippet above is the PDF syntax to render that command in a viewer. I've determined that the problem is caused by the PDF "code" switching fonts multiple times in order to render it in a viewer. For letters and numbers, use a standard font. When a MMA character is needed, the PDF switches to an MMA font. I want to parse the snippet and write some code to switch fonts as needed to make it valid in front-end. – Jim Marks Feb 13 '23 at 00:44
  • 1
    Probably you can [edit] your question to explain in detail how is that code obtained and what is supposed to be the format. What you explain in the comments doesn't make much sense to me. As far as I can tell, that is not how PDF format works and it's unlikely (not impossible) that you would get valid code out of a PDF. – rhermans Feb 13 '23 at 09:11
  • 1
    @rhermans I have edited my original question to include some more detail (which is what I should have done int the first place :-) – Jim Marks Feb 13 '23 at 15:06

0 Answers0