How to anonymize new log records without breaking relations between old and new data?

I am generating log records about user actions. For privacy reasons, these need to be anonymized after N days. However, I also need to run reports against this anonymized data.

I want all actions by real user A to be listed under fake user X in the anonymized logs - records of one user must still remain records of one (fake) user in the logs. This obviously means that I need to have some mapping between real and fake users, which I use when anonymizing new records. Of course, this totally defeats the point of anonymization - if there's a mapping, the original user data can be restored.

Example:

User Frank Müller bought 3 cans of soup.

Three days later, User Frank Müller asked for refund for 3 cans of soup.

When I anonymize the second log entry, the first one has already been anonymized. I still want both log records to point to the same user. Well, that seems almost impossible to me in practice, so I would like to use some method of splitting up data that hopefully allows me to keep as much integrity as possible in the data. Perhaps using the logs as a data warehouse - split everything into facts and just accept the fact that some dimensions cannot be analyzed?

Have you encountered such a scenario before? What are my options here? I obviously need to make some sort of compromise - what has proven effective for you? How to get the most use out of such data?

At the risk of being pedantic, what you describe is not anonymous data, but rather pseudonymous data. That said, have you considered using some sort of keyed hash function such as HMAC-SHA1 to perform the pseudonym generation? You can reach a fair compromise with a scheme like this:

Separate your analysis and OLTP databases. Minimize the number of people that have access to both.
Keep the HMAC key private to the application that copies data to the analysis database, not accessible from either database. Perhaps have the application generate it on installation and obfuscate it using a hardcoded key, so that neither the system administrators nor the software developers will find it trivial to get at without collusion.
Do not copy real names and addresses or any equivalent or easily linkable keys such as such as user number, invoice numbers, etc. from the OLTP database without hashing them.

If you do this, there are two main routes of attack to obtain the real identity from the pseudonym.

Direct attack: Obtain the HMAC key, compute the pseudonym for each known user, and reverse the lookup in the resulting table. (HMAC is irreversible: given only a pseudonym and the key you cannot feasibly obtain the original value.)
Information fusion attack: Without knowledge of the key and list of identities, the next best thing is simply to attempt to correlate the pseudonymous data with other data -- perhaps even a stolen copy of the OLTP database.

Pseudonymous data sets are notoriously vulnerable to information fusion attacks -- you have to strip out or "blur" a lot of key correlating information to make the data set resistant to such attacks, but exactly how much you need to strip is a topic of current research.

ansaurus

tags:

views:

answers:

How to anonymize new log records without breaking relations between old and new data?

related questions