15

According to Security SE it is highly recommendable to cover your email address in publicly available documents. I'm aware of the discussion about handling email addresses as images: Crawler Resistance Email Address

Anyhow, I'm looking for a non-image solution to this problem: Is there the possibility to make use of Unicode characters (that might be not recognized by the crawler software) in combination with XeTeX?

MrD
  • 2,716
  • Untested: accsupp package from the oberdiek bundle, perhaps? My idea is to have two layers: one visible (PDF) and the other invisible (for crawlers). My next idea is to hide it as a layer using ocgx, but I am afraid that crawlers can still see all of that. – Malipivo Mar 31 '14 at 10:25
  • 6
    Did you read the Security SE answer pointing out that such trick as a pain for legitimate users without actually achieving anything? – Joseph Wright Mar 31 '14 at 10:25
  • Ah, that is an interesting trick mentioned in that answer: convert an email address to outlines, but then a simple OCR scan can find it, I'm afraid that is the same problem as with images included in PDF files. – Malipivo Mar 31 '14 at 10:30
  • @JosephWright: They didn't talked about using Unicode, rather that replacing @ by .at. and so on. Spam crawlers know this, but I doubt they know that = e and so on, as I think that this is hardly ever in use. – MrD Mar 31 '14 at 10:34
  • @Malipivo: I don't think that the crawlers will have OCR running on ALL PDFs/images they find in the net. The fewest images include email addresses, so I doubt that it is worth the effort. – MrD Mar 31 '14 at 10:34
  • 3
    Not Unicode-related, but still my favorite obfuscation: john.doemy@pantsfoo.com – to e-mail me, remove my pants. (from [su]’s Does e-mail address obfuscation actually work?) – doncherry Mar 31 '14 at 10:35
  • @DL6ER You are right, but extrude images from PDF files and test only smaller ones (and just the wider ones) is not that difficult task, generally speaking. – Malipivo Mar 31 '14 at 10:40
  • @doncherry: with my accessibility hat put on, this requires knowledge of the language and will give you a lot of bounces to john.doemy@foo.com, so it probably also requires knowledge about how last names in your culture are constructed. – Ulrich Schwarz Mar 31 '14 at 11:04
  • @UlrichSchwarz: I think this is to much knowledge for such an algorithm. Even if the sentence is known, because it has been discussed in a very large community (which might of course also include spam crawler programmers), you can change the sentence an/or use other words to be removed. In addition, there are extremely complicated last names around which could hardly be recognized. – MrD Mar 31 '14 at 11:11
  • 4
    Do you really get enough spam to make it worth it? My email address has been in plain text on latex and other source files since before the UK joined the internet and I don't see that much spam (I get sent a lot, up to several thousand a day when there's a bad virus like sobig going around) but I just have a very aggressive spam filter (and allow my ISP and gmail to filter as well) – David Carlisle Mar 31 '14 at 11:34
  • @DavidCarlisle it's highly variable between ISPs, but it also depends on your tolerance for false positive accusations of spam - mine is ~0, so I have had to override the spam processing in my corporate gmail account or miss important emails from the same domain I'm on because there's no ability to tune it (even every recipient of an internal mailing list repeatedly marking not junk doesn't make a difference). – Chris H Mar 31 '14 at 12:07
  • @DavidCarlisle: I do almost get no spam (less that one per day on average), primarily because I use my real email only rarely. However, I have to publish a lot of things in the coming years and I'm looking for a solution to keep that low spam level like it is... Because when you don't have to filter anything, then you cannot miss any email which might be important. – MrD Mar 31 '14 at 14:40
  • @DL6ER: My point exactly. This is not about a spammer getting your adress, this is about you getting legitimate feedback from people who do not speak your language. This looks like a very low barrier, until you set your operating system language to Polish and then try to get anything done. (Language chosen arbitrarily as one I don't speak.) – Ulrich Schwarz Mar 31 '14 at 15:33

2 Answers2

16

the randtext package may help; its ctan catalogue entry says

<description>
The package provides a single macro <tt>\randomize{TEXT}</tt> that
typesets the characters of <tt>TEXT</tt> in random order, such that
the resulting output appears correct, but most automated attempts
to read the file will misunderstand it.

This function allows one to include an email address in a TeX
document and publish it online without fear of email address
harvesters or spammers easily picking up the address.
</description>

of course, it’s not proof against ocr, but it provides another way to approach the problem.

wasteofspace
  • 5,352
  • 1
    Copy-Past the e-mail address with an ordinary PDF viewer gives me a dd r e s s @ e xa m ple . d e. This could be analyzed. I think they are intelligent enough to parse HTML and also PDF before they are looking for addresses. – MrD Mar 31 '14 at 11:08
10

You can do this: [@.] instead of myname@example.com but note this is real text so a non-spam user isn't prevented from copying the address and perhaps not noticing that it is unlikely to work. (If the bit between [ and ] isn't readable you may need to install some fonts to cover that range)

David Carlisle
  • 757,742
  • 2
    Generated with http://ewellic.org/mathtext.html which lists among its features: Confuse your enemies by obfuscating your text! Annoy the hell out of the Unicode Technical Committee! – David Carlisle Mar 31 '14 at 10:47
  • This program seems to be Windows(R) only, so I will have to try it when I get my handy on such a machine. – MrD Mar 31 '14 at 11:19
  • 2
    @DL6ER yes although it's a trivial transposition of the character codes up to the math alphabet range, so you could just write it in tex directly. I just always liked the release comment (especially as I was shown it by a member of the said Unicode committee:-) – David Carlisle Mar 31 '14 at 11:23
  • Because obviously, square square square square square is much better than real email addresses. (Translations: fonts are fonts and this solution is less than ideal) – Doorknob Mar 31 '14 at 21:04
  • @Doorknob if you see missing glyph markers in my answer it means that your system doesn't have math fonts which is in itself a bad thing that you should fix:-) (windows comes with cambria math, os X comes with stix, so I guess you are on linux, so should install stix or asana math or something) – David Carlisle Mar 31 '14 at 21:06
  • @DavidCarlisle On Windows 7, Chrome v33 shows square square... while IE11 shows the correct font. – darthbith Mar 31 '14 at 23:00
  • 2
    @darthbith firefox works of course as well, on chrome yes even v 36 (canary release) doesn't show most of the characters. Browser bugs is browser bugs. Of course they don't affect the actual question (which was assuming pdf rendering) – David Carlisle Mar 31 '14 at 23:34