9

When I open a PDF file compiled from LaTeX with a text editor (e.g. Notepad++), the content of the file is a not human readable and it looks like below, so it seems to me that the information cannot be processed by potential crawlers.

n×å9.â^Ñäùàɨ•” HTÏ•ì#ò Ž–}q”mäŠ9ÒrbtRšá™g—û}Açú¦nƒÖ…‡­”jKœˆ FàÆµÀmþåá•N:º‚~éWF¶DX‹m#‚D˜Àm;Ñum?OŠÀÊ¢ßÎ[ÈuóõÄ÷;Ý6"-@pñäÙ(ÖXÜÕËaœyýûdRìørêÑbβ(\n^Øþ2Öƒ;¬÷ª»¦Òv0þ®±úßY'°³½‹%…ߥºíúŸKåÒì¶¶\êæñÕ_–áúª –ò1üj9¶,Ö×VæY¼wæ¬Döð}]

Is there a possibility to generate the PDF file so that when the PDF document displays an information like "Specific detail 1", I can also find this string "Specific detail 1" in the binary content of the file when I open it with a text editor?

This is useful for example when a PDF resume is created in LaTeX and it must be automatically parsed by various text analyzers.

  • 2
    Welcome to TeX.SX! What you see is called compression and contains the textual output (pdftotext etc. are able to process it). Which crawler do you refer to that does not support compression? – TeXnician Aug 29 '18 at 09:22
  • I didn't know that crawlers support reading compressed PDFs. I assumed that if it's not human readable then it's also not crawler readable. – Alexandru Irimiea Aug 29 '18 at 22:12

2 Answers2

11

For pdfTeX

\pdfcompresslevel = 0 %
\pdfobjcompresslevel = 0 %

For LuaTeX

\pdfvariable compresslevel = 0 %
\pdfvariable objcompresslevel = 0 %

For use with (x)dvipdfmx (XeTeX, upTeX, etc.)

\special{dvipdfmx:config z 0}
\special{dvipdfmx:config C 0x40}

For ps2pdf routes

\special{/setdistillerparams where
    {pop<</CompressPages false>>setdistillerparams}
  if
}
\special{/setdistillerparams where
    {pop<</CompressStreams false>>setdistillerparams}
  if
}
Joseph Wright
  • 259,911
  • 34
  • 706
  • 1,036
  • 1
    Probably in the near future I'll add an interface for this to expl3. – Joseph Wright Aug 29 '18 at 09:25
  • 1
    +1, but I wonder whether a crawler that doesn't support PDF compression likes BT /F8 9.9626 Tf 148.712 707.125 Td [(My)-333(sup)-28(er)-333(imp)-28(ortan)28(t)-334(text.)]TJ 154.421 -567.87 Td [(1)]TJ ET – TeXnician Aug 29 '18 at 09:27
  • @TeXnician Sure, we can't do that much about that! – Joseph Wright Aug 29 '18 at 09:28
  • 3
    @AlexG Sure, but the point is that if a crawler can't understand PDF compression, only 'text' in a PDF, it probably can't follow the kerning and whatever either – Joseph Wright Aug 29 '18 at 10:14
1

Since l3kernel 2021-02-18 or latex2e 2021-06-01, there's an all-in-one latex3 function \pdf_uncompress:.

Quoting its doc in texdoc interface3 (2023-02-22), sec. 36.4 Compression:

\pdf_uncompress:
New: 2021-02-10

Disables any compression of the PDF, where possible.> This function may only be used up to the point where the PDF file is initialised.

Prior to l3kernel 2021-02-18 and since l3pdf 2019-07-01, \pdf_uncompress: is available if l3pdf package is loaded. l3pdf was moved to l3kernel on 2021-02-18.

muzimuzhi Z
  • 26,474