I am generating log records about user actions. For privacy reasons, these need to be anonymized after N days. However, I also need to run reports against this anonymized data.
I want all actions by real user A to be listed under fake user X in the anonymized logs - records of one user must still remain records of one (fake) user in the logs. This obviously means that I need to have some mapping between real and fake users, which I use when anonymizing new records. Of course, this totally defeats the point of anonymization - if there's a mapping, the original user data can be restored.
Example:
User Frank Müller bought 3 cans of soup.
Three days later, User Frank Müller asked for refund for 3 cans of soup.
When I anonymize the second log entry, the first one has already been anonymized. I still want both log records to point to the same user. Well, that seems almost impossible to me in practice, so I would like to use some method of splitting up data that hopefully allows me to keep as much integrity as possible in the data. Perhaps using the logs as a data warehouse - split everything into facts and just accept the fact that some dimensions cannot be analyzed?
Have you encountered such a scenario before? What are my options here? I obviously need to make some sort of compromise - what has proven effective for you? How to get the most use out of such data?