PDF with bizarre source code - can anything be retrieved from it?

Question

My boyfriend's brother committed suicide last weekend, and left behind a will... in the form of a damaged PDF. I've tried to repair it using the various online services, to no avail. Looking at the source code of the PDF, it's clear something very strange is going on. There are huge blocks of "ÿ" character repeating, and then bits of HTML, including a partial "mergeOptions()" JavaScript function, which looks like something one uses for autocomplete. How could this happen? The PDF ought to contain only a scan of the handwritten will. I understand PDF files can become corrupted during transfer from one device to another, but how could an HTML page have gotten mixed up into the source code? Also, it looks like a page which my boyfriend's brother might have consulted, as it's related to his job, but not something he would have included in his will.

It seems like repairing the file is a lost cause (have tried Foxit, Xpdf, GhostScript, and some other stuff), but is there no way to recover anything from it? In between the gibberish are blocks of code which resembles PDF code. But even when I remove the gibberish and add PDF headers to the source code, the PDF appears as a blank page.

Also, is there any explanation for what could have happened to this file? It's just baffling. The file was on a hard drive, so at first we thought it could have been corrupted during transfer from the laptop to the hard drive. But we've now recovered the file from the laptop as well and it is exactly the same.

Any help/advice would be greatly appreciated, thanks!!

Editing to add a sample of what the file contains:

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿelse {
                        return false;
                    }
                };
        }
        if (options.categories !== undefined &amp;&amp; options.categories instanceof Array) {

            var categories = [];
enter code here

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿU¢¢JGƒ€èZ    m[Rïœ·}h‹—Ø$îõø»Ýïuâ;>ïØÂÍ šŒé¿‡÷£6ëÅÕ~            4;²Õ •j,Mîîî¼i_Uš            Ö÷¿“È#æU¼A·ñ€ŠïØi™[/±P‡¸cßé]_{×Ìµæ %»Ù¹²” C>{üŸ6¹0CWá½€!ÿÿÿò—Š‹Dâ€ƒä>•wœn\ L’ê+ÙDD{99DqBúž²">H¾»lÌ/ÏéÝw×9\UÅ?Øý ÍÕ[nò•b«Ú6'±ºþbÛ6ÞA±| Z0% ðô,à¤MûèÓ86/™¸Ù$\É|n‘8çcZ
áá Ó^½9šæ
4œA8®Âö¸.ÙÒ;lk8œ»
vú~þÿ{¾Äéôø¿ŽþÏZÑüv‹ù1Þý;uõüWÑxVÎßó(ŸÎ÷åµméÖÎŸ hN½Ç>‚Îf“™¯…7Ü?÷ÿÓÿp>çþŸoíûÿÿÿà ¼è€í:!”õÄƒb@Ùhr@(¥áíËÌó@Æm•¨®6®ÛÔØ[³¹‰(×²»®äM1ÈÖßþ÷—Ô,¬7…^½T &amp;Ðh?JáG¢=ìÒk8ƒìY I¡Ì¹°è°»?*s´ëœüTYÒ‹Aß'arO(ÌëÁ˜/HÉ;ˆ³çI80}ÎÐx6Ãç~üeÿ»'Qïµ]ø'«ü÷ï±º '.îÞ¯õ ÿ +ùû¹ééÏÑ÷ß }wû¾BëxºŠþêúú.ûÕÞ¬ïÿyêã¼ïžªï½Wî^¬;ÏWÕûÿWï»ÓÕÃîùê¾¯ßw ð&gt;qÀ UAš 3Æa)B®¾¿ÿóÙ&amp;ÅžO_ÿõ/þ&amp;‹ßd$ÙZÐŒ}Õ§Õj÷×¨)Ö’š„Îô«Y. ®¼²ÞYR¿o–aDj.k1g¬üÛÆ‚o¿‘5¡OúÅú÷àœµN¤ñ8¿šÓÒ”^7ùÄÔöf£_˜bJëUŸ¬¥nÚþN›u¥†ÿ—+ÿÉB_©¼¨& ùÝõ“ä‚NïÛòAml·¿V‰PÂ„/²Fm?Ýô½V/QŸw{ösì×ÙÕú4Óo‹ts„Ì,ÜŠ5õ?Ó4ë°ˆz™:Åº~ËÝ¡='–«ðÇAFì}{æ¯âü®²ü°ß{_a;ß»òÍwß—¶è¿áF †6ÒéAÔÔÛ¦ZZºº©HWY;¯`LA/Z%¼ƒEFhzm˜Bôû÷7‹ÿA#Yá±2ÕË5K-uJÇnõ'Tõß(´$T:{á³Ü¾6fI¤Ó¾—©$Ïýœÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

The JavaScript is just there that one time. After that there are alternating blocks of the ÿ character and the kind of code shown above. The file starts and ends with the ÿ blocks. I shortened them in this snippet - they're actually much longer.

I don't know if this is the right way to share the file but here it is:

https://1drv.ms/b/s!AnGtFF6JZrtsgWSEYq_UiU7ib0rQ?e=xJpgFg

Thank you guys so much for all your help already, I'm honestly so moved. A violent death is a horrible thing.

Impossible to say without seeing (part of) that file. Files do get corrupted, but it's rare in my experience. Maybe it's not a pdf file at all? You could check that with the file utility (see here ) — Berend, May 30 '23 at 08:01
If version 10.01.1 of GS won't open the file at all then it's pretty badly corrupted, but as Berend says, it isn't possible to be certain that the data is totally unrecoverable without seeing the PDF file. Nothing you've described seems to me to be completely incomprehensible in a PDF file (XML metadata, JavaScript actions, binary image data are all possible). Did you get anything at all from Ghostscript ? Is it possible the file is password protected and therefore encrypted ? Did you try Acrobat ? — KenS, May 30 '23 at 11:22
Hi, and sorry for my late response, I was foolishly checking my e-mail for comment notifications rather than directly here.
So first of all, I used TrID to check the file type, and it wasn't recognised as a PDF, which makes sense as the PDF does NOT start with %PDF-1 nor is there any %%EOF anywhere in the file. There is a "ÝÝ*£~PDF6Ók·×ä" close to the start of the file.

I got nothing from GhostScript. Acrobat says the file is damaged and can't be read.

I can share the file, not sure if this is the right way to do it:

https://1drv.ms/b/s!AnGtFF6JZrtsgWSEYq_UiU7ib0rQ?e=xJpgFg — pennylane, May 31 '23 at 06:58
Oh, just wanted to add, @KenS, someone else also suggested to me that the file might be password-protected. When I try to open it (in Acrobat or anywhere else), I don't get a message saying that it's password-protected, only that the file is damaged. — pennylane, May 31 '23 at 07:10
Hi @K J, I checked the link you mentioned (stackoverflow.com/a/76310165/10802527) and yes, my source code looks like that. Trying to figure out if I can use the advice given in that thread to get something out of it. Thanks!! — pennylane, May 31 '23 at 07:23
the file you uploaded hardly contains any useful data. I suggest you explain here reddit.com/r/datarecovery/. Do NOT write to the drive you got the file from. — Joep van Steen, May 31 '23 at 12:10

Joep van Steen · Answer 1 · 2023-05-31T12:45:09.577

IF there's a JPEG in there still encoded as JPEG you should be able to get out like so:

Grab a copy of JpegSnoop
Drop the file on there
Ignore complaint of the tool
Tools > Image search fwd

EDIT

Now having had the opportunity to look at the file:

With an entropy of 0.99 the file can hardly contain any data, even from a flat text file you expect higher entropy. The higher the entropy, the more data in general, while zero entropy would mean no data at all. This is a quick way to assess how much useful data a file can accommodate.

I work with, recover and repair JPEG data every day and I feel confident in stating you file contains no JPEG data at all, not even fragments.

In all likeliness, the file you currently have is as good as useless.

So rather than a repair case, this is a case where you can try to see if you can recover the file. I'd therefor:

either consult with a data recovery specialist or find a community where such people hang out, suggestion: www.reddit.com/r/datarecovery/.
what you could do yourself is image the drive you got the file from: look into ddrescue.
Once you have the image file, put the drive aside and work with the image file exclusively.
use file recovery software and determine if perhaps deleted versions of the document are recoverable.
if that yields no results, a tool like PhotoRec could try recover each and every PDF file on the drive.
since at some point a photo of a physical document was taken, you might also consider expanding the search towards JPEG files and see if the JPEG that was later embedded into a PDF file still exists somewhere.

Hi, and thanks, Joep van Steen, this is actually the kind of thing I was hoping for as I thought maybe I could recover the actual scanned image or bits of it... unfortunately, though, it "no SOI Marker found in forward search". :( — pennylane, May 31 '23 at 07:18
@KJ, you may be correct, I have no idea about the legal side of all this. I only described how I would fly this from a data recovery, forensic angle. — Joep van Steen, May 31 '23 at 13:21
@JoepvanSteen thank you so much for your advice. I think we've already recovered all data from the drive but will look into that. — pennylane, Jun 01 '23 at 07:27
@KJ we're in France, and here the rules are slightly different, but equally problematic in our case - a scanned will is only accepted if it isn't contested. If it's contested (which it's quite likely to be, we'd need the original paper copy... which was lost in the mail). — pennylane, Jun 01 '23 at 07:27

score 0 · Answer 2 · answered May 30 '23 at 08:22

0

First of all, sorry for your loss.
Having to face a data-recovery problem on top of that is really unfortunate.
And I know exactly how it feels. I had a similar situation with the broken hard-drive in my uncles laptop after his demise.

This feels like a case of hard-drive damage that happened AFTER the PDF was created and BEFORE you first tried to read the file.
Seems chkdsk or some other recovery tool "repaired" the problem but in doing so merged the PDF with other, random, unrelated data (browser cache content probably).
This is not unusual. No automated repair solution is 100% accurate. But it really sucks in this case.
The ÿ characters are another indicator of this. This is unicode 0x00FF and a repeating pattern of 00 and FF bytes is often the default content of an empty disk-block. If that gets linked into the file by chkdsk during the repair this is what you typically get.

What makes matters worse is that the PDF probably contained an embedded image (JPG probably) if it held a scan of a paper document. (PDF's can't store binary data natively, it is purely a vector format. Binary data is always embedded as a blob in another format. For images usually as a JPEG or sometimes PNG or TIFF.)
So even if you can somewhat repair the PDF headers you're still dealing with the corruption in the embedded image. Repairing that (with parts of the image missing/overwritten as well) is next to impossible as most image formats have internal compression and repairing a damaged compressed file is several orders of magnitude more complicated than an uncompressed file.

So I'm afraid this PDF is a loss.

However maybe not all is lost.

Go through that entire computer to see if another copy is lurking around (cache/temp folders). If there are backups see if they go back until before the corruption happened.
Also look at each image file you can find. If the will was scanned there is a high chance it was saved in some image format somewhere before it was imported in the PDF file.
Also check word-processing documents (doc, docx, etc). It is possible the PDF was originally in Word or Wordpad and then "saved as PDF". The original file might still be around, maybe under a different name.
(In my own case I found 3 Word documents called Untitled.docx having relevant data. My uncle saved them, but never bothered to give them a proper name.)

Also think of online storage. If the PDF got backup-ed there (e.g OneDrive) it could be a previous, good version is still available through the version history there.
Also check (online) photo-storage on his phone. Possible that the "scan" was originally a photo of the paper document and that photo is still around.

answered May 30 '23 at 08:22

Tonny

31,463

2

" (PDF's can't store binary data natively, it is purely a vector format. Binary data is always embedded as a blob in another format. For images usually as a JPEG or sometimes PNG or TIFF.)" I'm afraid this is incorrect. It is entirely possible for PDF files to contain binary data, and image data need not be encapsulated in some other format, PNG and TIFF are not supported in PDF. It is possible for the repeating binary 0x00 0xFF sequence to be valid PDF. PDF also includes XML and may include JavaScript actions, so nothing described here is totally beyond the bounds of belief in a PDF file. – KenS May 30 '23 at 11:18
@KenS This gets really technical, but pure PDF is just glorified postscript, which is text (and vectors described as text)-only. Binary data could only be included encoded in text-strings and you needed to add your own postscript macros to decode and do something useful with that data. Originally direct image handling was ONLY possible by embedding a TIFF image. Later Adobe added support to embed additional things (like XML, javascript, more image formats, Flash, even some video formats), but technically these are not part of the PDF spec itself. PDF just allows to embed these. – Tonny May 30 '23 at 11:28
2

I disagree. PDF is no longer PostScript, in fact it never was, it merely shared (mostly) the same graphics model. PostScript is a programming language, PDF is not. You cannot store a TIFF image (with TIFF formatting) in a PDF file in a way which a PDF interpreter can use. You can store it as an embedded file but that's not part of the PDF. You can however store binary data in a stream XObject (which is not a text string). You cannot and never could, use PostScript 'macros' in a PDF file. You could (for a time) use PostScript XObjects but that's not part of PDF in the same way JS isn't. – KenS May 30 '23 at 12:44
FWIW page 80 of the PDF 1.0 specification (the Pewter book) has an example of an image XObject, which is clearly not a TIFF. While that example encodes the data with ASCIIHex, that is not a requirement, it would be equally possible to specify a compression filter like LZW or simply use raw binary data. – KenS May 30 '23 at 12:48
With regards to damage to image: It really depends on extent of the damage and if actual data is largely present, but some times repair is possible. – Joep van Steen May 30 '23 at 13:12
Thank you so much @Tonny for your advice. We've recovered several terabytes of data from the computer and haven't found anything, but will try looking at Word or untitled files - good idea, thanks!! Unfortunately we aren't able to get into his telephone as it's fingerprint-protected. We were particularly anxious to see if he received any calls or text messages which could have provoked him to do what he did but there seems to be no way to get in there. – pennylane May 31 '23 at 07:09
@KJ you're absolutely right and as I explained in a separate comment the scan's probably not going to hold up in court, especially if it's contested (as it probably will be). As you suggested, it's more a matter of intellectually wanting to understand what happened, and also frustrating to know that these were someone's last words and wishes, and that we now won't ever know what they were. – pennylane Jun 01 '23 at 07:37
@KJ I'm really sorry to hear that, and what you've been going through. Pretty much in the same situation here as we also more or less know he wanted to leave everything to his 82 year old mother. Now his scheming ex-wife gets it all instead. Sigh... – pennylane Jun 01 '23 at 13:09

PDF with bizarre source code - can anything be retrieved from it?

2 Answers2