Steganography in Latex

Question

I am a teacher. An ongoing problem is that students post assignments to websites like chegg.com to have other people write projects for them. I don't know if the students are providing JPGs or PDFs -- but what I see on the website is a text version of the assignment. See for example: https://www.chegg.com/homework-help/questions-and-answers/part-2-confidence-intervals-recovery-great-recession-2007-2009-economic-situation-many-fam-q37153607

For my assignments, the assignment is written in LaTex. I'm wondering how I might embed a unique identifier in each student's assignment so that I can tell by looking at the assignment who posted it. Let's imagine that I'm teaching 500 students (9 bits), and it would be good to encode the semester and year (5 bits), plus a parity bit for good measure-- so let's just say I need to encode 16 bits.

Requirements:

Cannot use font changes -- the font information is lost in the chegg posting.
Cannot use watermarks or other imagery -- again, this information has been stripped from the posting.
Cannot use spacing changes -- mainly because this runs counter to how LaTex is formatting the page, and I'm not sure that subtle spacing changes would survive to the posting. (in other words, no stegsnow)
The optimal answer would embed the identifier multiple times so that even if only a portion of the assignment is posted, it is possible to identify the source.

What this means, I believe, is that I basically need something that is visible, but easily overlooked in the text itself. Ideas:

Double a particular word that that a reader is likely to overlook --depending on which word is doubled, identity is know (such as my doubling the word "that" in the preceding text).
Using extra punctuation marks that might be easily overlooked.. (see double periods at the end).
Inserting occasoinal letter flips that look like sloppy spell checking but actually encode the identifier. (see spelling of occasional)

Anyone know of something that has already been implemented? Any suggestions about the easiest way to do this?

Here's a potential starting point for a solution. In the optimal implementation, the \stenagbox could include "complicated" text such as begin/end{enumerate}, multiple paragraphs with formatting, etc.

\documentclass[10pt]{article}

\newcommand{\stenagbox}[2]{#2}

\begin{document}

  \stenagbox{16000}{When moving to a new area, it is important to
    understand the climate that you will be living in.  Does it rain
    more or less than you are used to?  Will it typically be hotter or
    colder than the city that you are coming from?  Just knowing where
    a city is located on a map is not sufficient.  In some areas,
    nearby mountains may block the wind and make the climate hotter or
    colder than expected.  In other areas, the ocean may keep the
    region cool in the summer time and warm in the winter.  Using
    statistics, the climate in two areas can be compared to determine
    what to expect.}

\end{document}

You seem to have some ideas in mind. Can you clarify what kind of help with LaTeX you are looking for? — Davislor, May 08 '19 at 00:22
@Davislor I would assume OP is wanting a way to TeX a file and get ideas 1-3 to occur at set points. But one question would be if OP is willing to post 500 different pdfs to the students. — Teepeemm, May 08 '19 at 00:36
@Davislor I have high level knowledge of what might be done. I don't have the LaTex skill to do it and don't know if there are any existing implementations that might work (or be easilty modified to work). — Christopher Donham, May 08 '19 at 00:36
@Teepeemm If I did this, students would not get a PDF of the assignment. They would get a physical copy only. Students would hand back the physical copy with their name on it when they handed in the assignment so that I had the correspondence of ID with student. — Christopher Donham, May 08 '19 at 00:39
Well, a simple implementation might take a counter and decompose it into bits. Then you'd write the minor textual variants with \if and \else. Maybe \variant{"2}{it’s not}{it isn’t}, for each bit of the counter, where “It’s not” represents a zero in bit 2, and “it isn’t” represents a one. — Davislor, May 08 '19 at 01:03
Using an invisible watermark that Is unique to each student would work great for pdfs. But this info will probably get lost if an image is used or OCR software is employed. For the particular example you like, my suggestion would be to randomize the numbers for each student so that they still need to do the arithmetic. The three suggestions you post are probably better coded outside of LaTeX as is the management of tracking which version was assigned to which student. — Peter Grill, May 08 '19 at 01:03
@PeterGrill Keep in mind that I do not have to keep track of which version was assigned to which student. My idea is that I hand out paper copies that are unique, have the student return their assignment with their name on it. Then, if I find a posting that has the assignment, I have to match it to the assignment that was handed in (time consuming). I've been teaching for 6 years now and this year is the first time I have really had this problem so my hope is that in general, I don't have to worry about it. — Christopher Donham, May 08 '19 at 01:09
@PeterGrill As for solving it in LaTex, I was thinking that somehow, I would put the text where I wanted this encoded in some kind of box, and pass in on the command line the number I want encoded. \steganbox{This is the text to manipulate to encode the string}. A loop in a Makefile would then call LaTex multiple times to make the various versions. I would concatenate the 500 PDFs into one large file and send it to the printer to make the prints. — Christopher Donham, May 08 '19 at 01:11
Yeah, that is a good idea, about not tracking it, but allows for the possibility that students work together and one student scans another students assignment (once it is known that the assignments are unique) and posts it... Per your other comment, if you are willing to put a macro around each of the possible options than those can be executed randomly or based on \def specified by each run. In that case, this question becomes how you want to specify the various options (\RepeatWord, \RepeatPunctuation, \MispellWord, etc). In that case, this becomes pretty easy to solve. — Peter Grill, May 08 '19 at 01:15
@PeterGrill The ideas were meant to provide possible ways of doing it, not options that I need to be able to select in the file. I confess that I'm still not sure how to do the actual LaTex implementation of this (e.g. how do I step through a box and find words and then transpose letters, or how do I step through a box and find the punctuation to duplicate, or how do I decompose the index I want to encode into bits that are then implemented in the text). I can see how to specify the problem such that it should be solvable -- but not how to actually do it. — Christopher Donham, May 08 '19 at 01:20
This conversation has developed to the point where it seems like a skeleton of a tex file (e.g. MWE) is warranted. I will add something to the post. — Christopher Donham, May 08 '19 at 01:24
You only want to ID who posted since the returning answer would be unlikely to have an ID other than its too perfect for that student the coding does not need to be "tight" to be repetitive so if ocr is used its likely to highlight (be avoided) poor or duplicate wording except when in a code block thus a strategic placement of say \bf in a location known to you but surplus to the core assignment could identify one semester whilst a \small might identify a year and a tabular number might be 7.601 without attracting too much attention — , May 08 '19 at 01:29
For some odd reason, this question seems peripherally related: https://tex.stackexchange.com/questions/237520/security-printing-in-pdflatex-documents. It is about security protection to discourage unwanted photocopying. — Steven B. Segletes, May 08 '19 at 01:33
@StevenB.Segletes I don't think it is similar enough that was based on plagiarising output unfortunately in these times we need to see who is plagiarising input since the output will be unique — , May 08 '19 at 01:36
@KJO Yes, I understand the request is different that what I referenced, but may be of interest to a prof providing hardcopy exams. — Steven B. Segletes, May 08 '19 at 01:37
@StevenB.Segletes I want kind of the opposite of your posting. I want the information to survive copying so that I can trace back the origin of a posting... (PS, thanks, though for the thought) — Christopher Donham, May 08 '19 at 01:37
Why not put a QR code on that is almost the same on every exam? If it gets posted then you can recover whatever text is encoded in the QR code. — JPi, May 08 '19 at 01:41
@JPi Because I am not sure if chegg is running the input file through some kind of OCR to generate the posting. I cannot say for sure, but some of the postings look like some kind of processing is happening -- on my assignment, for example, a logo was stripped from the posted version. If a QR code survived the process, then a unique watermark would as well. — Christopher Donham, May 08 '19 at 01:43
Just introduce a handful of acronyms, LaTeX vs. LATEX vs. LaTEX vs. lateX vs. LateX. (To be honest, IMHO this is the only way how I could relate this question to LaTeX and friends, and not to cryptography, say. ;-) — , May 08 '19 at 02:02
An idea to make the task of checking which version was uploaded less cumbersome is to add a unique identifier to the first page. The identifier should be encoded in a way that you easily can “rebuild” manually depending on what combination of steganographic variations was used (a number, a graphical symbol, a dividing bar of varying length/width, ...). It doesn’t matter if this information is stripped from the leaked answer, you would only need it for the physical copies. But if it is retained, all the better. It would also have the benefit to be more easy to find non-unique variations... — Kess Vargavind, May 08 '19 at 03:49
A related (though different) question: https://tex.stackexchange.com/q/35133/5626, for people looking for another solution to a similar problem. — mbork, May 09 '19 at 05:12

Steven B. Segletes · Accepted Answer · 2019-05-08T18:27:16.277

I create a \bitstream[<total tests>]{<test number>} (default 256 total tests) that writes a token register of the binary bits comprising the test number. I demonstrate how it works in the MWE (you don't need that in your test preparation, it was only for demonstration)

Then, to encode your test versions, one uses \dobit{<output A>}{<output B>} to place slight differences into the output stream (i.e., the printed test). Each time it is invoked, it sucks the high-order bit from the \bits token register and uses it to decide output A versus B.

In the MWE, I created an 8-test matrix, requiring 3 bits (2^3=8), and so 3 \dobit choices are encountered to create 8 unique versions of the test. The versions have a comma included or not, spell "versions" correctly or not, and repeat the word "the" or not.

Whereas I just do a \bigskip to separate the test versions, presumably, one would use a \clearpage so that individual tests would appear on separate pieces of paper.

\documentclass{article}
\usepackage{pgffor}
\newcounter{bitreg}
\newcounter{bitval}
\newtoks\bits
\newcommand{\addtotoks}[2]{#1\expandafter{\the#1#2}}
\newcommand\bitstream[2][256]{%
  \setcounter{bitreg}{#2}%
  \setcounter{bitval}{\the\numexpr#1/2\relax}%
  \bits{}%
  \bitstreamaux%
}
\newcommand\bitstreamaux{%
  \addtocounter{bitreg}{-\thebitval}%
  \ifnum\thebitreg>-1\relax\addtotoks\bits{1}\else
    \addtotoks\bits{0}\addtocounter{bitreg}{\thebitval}\fi
  \ifnum\thebitval=1\relax\else%
    \setcounter{bitval}{\the\numexpr\thebitval/2\relax}%
    \expandafter\bitstreamaux%
  \fi
}
\newcommand\dobit[2]{%
  \expandafter\checkbit\the\bits\relax
  \ifnum\thisbit=1\relax#2\else#1\fi
}
\def\checkbit#1#2\relax{\gdef\thisbit{#1}\bits{#2}}
\begin{document}
\bitstream{255} \the\bits

\bitstream{128} \the\bits

\bitstream{53} \the\bits

\bitstream[8]{5} \the\bits

\foreach\x in{0,...,7}{\bitstream[8]{\x}
Test \x:
This is a test\dobit{}{,} of % COMMA IN OR NOT
multiple vers\dobit{io}{oi}ns. The test %VERSIONS MISSPELLED OR NOT
is for all \dobit{the the}{the} marbles.\bigskip\par% THE REPEATED OR NOT.
}
\end{document}

SUPPLEMENT

One additional note of interest. While the <total tests> are normally expected to be a power of 2, it seems to be the case that if they are not, unique tests will still be generated. However, the \bitstream will not correspond to the binary representation of the <test number>. For example, the following bit streams for 9 total tests,

\bitstream[9]{0}\the\bits\par
\bitstream[9]{1}\the\bits\par
\bitstream[9]{2}\the\bits\par
\bitstream[9]{3}\the\bits (4 not 3)\par
\bitstream[9]{4}\the\bits (5 not 4)\par
\bitstream[9]{5}\the\bits (8 not 5)\par
\bitstream[9]{6}\the\bits (9 not 6)\par
\bitstream[9]{7}\the\bits (10 not 7)\par
\bitstream[9]{8}\the\bits (12 not 8)\par

produces 9 unique results, just not the bitstreams corresponding to the numbers 0 through 8. The artifact arises from the integer arithmetic associated with the /2 division operation on numbers that are not powers of 2.

DOUBLE SUPPLEMENT

One possible gotcha (user error) is if you fail to issue enough \dobit calls to match the number of bits allocated to your \bitstream. Then, the digits that differentiate the cases never make it into the test, and so some cases might not be differentiated.

A fix for that user error is to build the \bitstream starting with the LSB (least significant bit), rather than the MSB (most significant bit). That way, even an incomplete number of \dobit invocations would still provide differentiation.

Here is a version of the answer that builds the \bitstream from LSB to MSB, rather than the opposite.

\documentclass{article}
\usepackage{pgffor}
\newcounter{bitreg}
\newcounter{bitval}
\newtoks\bits
\newcommand{\apptoks}[2]{#1\expandafter{\expandafter#2\the#1}}
\newcommand\bitstream[2][256]{%
  \setcounter{bitreg}{#2}%
  \setcounter{bitval}{\the\numexpr#1/2\relax}%
  \bits{}%
  \bitstreamaux%
}
\newcommand\bitstreamaux{%
  \addtocounter{bitreg}{-\thebitval}%
  \ifnum\thebitreg>-1\relax\apptoks\bits{1}\else
    \apptoks\bits{0}\addtocounter{bitreg}{\thebitval}\fi
  \ifnum\thebitval=1\relax\else%
    \setcounter{bitval}{\the\numexpr\thebitval/2\relax}%
    \expandafter\bitstreamaux%
  \fi
}
\newcommand\dobit[2]{%
  \expandafter\checkbit\the\bits\relax
  \ifnum\thisbit=1\relax#2\else#1\fi
}
\def\checkbit#1#2\relax{\gdef\thisbit{#1}\bits{#2}}
\begin{document}
\bitstream{255} \the\bits

\bitstream{128} \the\bits

\bitstream{53} \the\bits

\bitstream[8]{5} \the\bits

\foreach\x in{0,...,7}{\bitstream[8]{\x}
Test \x:
This is a test\dobit{}{,} of % COMMA IN OR NOT
multiple vers\dobit{io}{oi}ns. The test %VERSIONS MISSPELLED OR NOT
is for all \dobit{the the}{the} marbles.\bigskip\par% THE REPEATED OR NOT.
}
\end{document}

+1 for using punctuation which is good in principle as less noticeable between two students comparing questions, the danger is that in OCR it may be corrected to point at the wrong student, it is less obvious but also less robust in terms of accuracy. — , May 08 '19 at 02:38
@KJO I chose some punctuation examples, because it was mentioned by the OP. The \dobit function allows any subtle difference to be introduced, at the testmaker's discretion. — Steven B. Segletes, May 08 '19 at 02:40
Interesting. Instead of having the code search through for the spot to modify, this is having me specify where each bit is encoded. I had not thought of doing it this way. Cool. This is the most complete of the two answers and I can see how to make it work for my situation. Thanks! — Christopher Donham, May 08 '19 at 02:48
@ChristopherDonham A perhaps more subtle way that would be easy to check is to modify the punctuation that comes after the question number. For example, Q1. Q2: Q3), etc. — erik, May 08 '19 at 03:25

Steganography in Latex

1 Answers1

Linked