How far should I go in creating anonymity?

Question

I am in the early design stages of a system (QuestGuide) to help researchers gather information (arbitrarily complex), analyze it, annotate it, visualize it, and, ultimately, share it. There are enough problems to solve that this feels like the software equivalent of the Seven Summits, but I'm retired and this project will keep me off the streets for a few years :-).

After watching a 2-part interview with Eben Moglen at Slashdot, I realized that I wasn't designing for / thinking about anonymity. Although my imagined target demographic is academic researchers (and maybe people shopping for a refrigerator), it is entirely possible that this system will be used by people who are gathering and correlating information that, if-and-when they share it, could get them imprisoned or worse.

For the purposes of being able to backtrack to the original source (e.g. for citations, source checking, etc.), I am collecting date-time, IP/URL of original sources, and all manner of stuff. This is being done as an aid for the researcher, but it could also be used as a source of information to work backward to the person who published their quest anonymously.

I have identified the following as needing removal / scrubbing / obfuscating:

UUIDs used need to be random and not related to MAC address and time.
All metadata with time-date, URL, IP, etc. must be removed.
The transmission of the quest must be encrypted with the receiver's public key, but not signed (or otherwise associated) with the sender.

Question:

How many more types of information that might be in either the data or in the supporting metadata can you think of that I need to scrub when someone wants to be really anonymous?

If there's a list somewhere of things that have tripped people up in the past (anecdotal or real), that would be educational.

Update / Clarification:

QuestGuide is 100% FOSS. I'm currently trying to understand how GPLv3 Affero plays in a system composed of many FOSS components: MariaDB, Django, and numerous other bits of FOSS released under various licenses. It may be necessary to fallback to GPLv2 Affero.

Although I'm sure there can be revenue generating operations based on it, I see them falling more along the lines of Red Hat selling support for a completely FOSS-based product. I'm retired and have no particular need or desire to be part of any of those operations.

The system itself is designed to run locally on the user's machine—DB, proxy server, UI, all of it. The "central site" is intended purely for the convenience of users who need multi-location access to their quests, who wish to share their results with the world, or who wish to download preconfigured components and entity definitions. Once QuestGuide is installed, the user does not need to have any further contact with any sort of centralized server.

It sounds like you're on top of the bits of data that directly identify the sender, but it's going to be difficult (or impossible) in many cases to have data sets be anonymous since the behavior being analyzed could be the very thing that makes it possible to de-anonymize the senders. Arvind Narayanan [http://33bits.org/] and other researchers have demonstrated that it's often possible to de-anonymize large proportions of any sufficiently interesting data sets. — pseudon, Oct 20 '12 at 13:39
@pseudon: Yeah, I know that that is an inherent risk, but there's really nothing I can do about it. I'm mostly looking for things I might miss that could unintentionally expose someone. — Peter Rowell, Oct 20 '12 at 15:54
QuestGuide sounds like an interesting project. Google-Refine is an interesting open source python project that may help: http://code.google.com/p/google-refine/ — rook, Oct 21 '12 at 18:11
@Rook: Thanks for the link. I had forgotten about Refine and just did a quick scan to refresh my memory. It's primary focus seems to be on data that is row/column oriented, and it's language seems to be (at least at the moment) Java. I'm an old school Text Retrieval / Computational Linguistics guy so QuestGuide's focus (not too surprisingly) is on feature extraction, named-entity recognition, correlation analysis, template filling, etc. In some sense, QG and Refine may make a good combination, but just how I haven't given any thought to. — Peter Rowell, Oct 21 '12 at 21:15

score 2 · Answer 1 · answered Oct 20 '12 at 18:47

2

It sounds like anonymity is the opposite of what you want.

If you see me naked is it still a violation if you don't know my name? Your system is a lot like this, you are collecting personal information and then obscuring who it belongs to but at the end of the day you need to associate a large fact base with an individual, and that is what people fear.

I would make sure your system is at least as bad as Google's stance on privacy. Google is pretty awful when it comes to privacy, but they are widespread and most of us have become accustomed to their targeted ads. If you are worse than Google then you might get the wrong kind of attention.

You should also respect the Do-Not-Track HTTP flag. It might not be against the law to ignore this flag right now, but I am sure it will be.

answered Oct 20 '12 at 18:47

rook

47,238
10
96
182

2

I'm ... confused, and I'm not really sure I understand your answer. The system is intended to support research—for your physics thesis, buying a new fridge, doing a history project, whatever. One of many entity types supported is "person", so it's possible to collect and save information about, say, about Gen. Andrew Jackson. Now if someone (freedom fighter?) uses QuestGuide as an intelligence gathering system (and it has much in common with one), and then wants to share info about a crooked politician, I don't want to accidentally blow their cover. Seriously: I'm trying not to be evil. – Peter Rowell Oct 21 '12 at 00:33
@PeterRowell Rook has a rather unusual style of explaining things. You'll get used to it (or not!). The important part of what he's saying is that your system is inherently going to break the normal privacy standards that we expect from online services, just through the openness of the platform. There's not much you can do about this, other than say "the good that this service provides exceeds the evil that some individuals might use it for". – Polynomial Oct 21 '12 at 12:08
@Polynomial: In the interest of brevity (not one of my long suits) I failed to describe the FOSS nature of the system. Please see my update above. – Peter Rowell Oct 21 '12 at 15:06

How far should I go in creating anonymity?

1 Answers1