ansaurus

Question

How to separate a person's identity from his personal data?

Answer 1

+4 A:

I'm afraid that if your application can link a person to its data, any developer/admin can.

The only thing you can do is making it harder to do the link, to slow the developer/admin, but if you make it harder to link users to data, you will make it harder for your server too.

Idea based on @no idea :

You can have a classic user/password login to your application (hashed password, or whatever), and a special "pass" used to keep your data secure. This "pass" wouldn't be stored in your database.

When your client log in your application I would have to provide user/password/pass. The user/password is checked with the database, and the pass would be used to load/write data.

When you need to write data, you make a hash of your "username/pass" couple, and store it as a key linking your client to your data.

When you need to load data, you make a hash of your "username/pass" couple, and load every data matching this hash.

This way it's impossible to make a link between your data and your user.

In another hand, (as I said in a comment to @no) beware of collisions. Plus if your user write a bad "pass" you can't check it.

Update : For the last part, I had another idea, you can store in your database a hash of your "pass/password" couple, this way you can check if your "pass" is okay.

Colin Hebert 2010-09-17 17:26:35

Thank you for taking your time to answer, but the application can only link the person to it's data if it know's his password (from which it can calculate the `user_hash`). Maybe I should have clarified that in the `users` table there is no `user_hash` column, to which persons data could be linked to.

Rene Saarsoo 2010-09-17 17:36:49

If your application can do the hash, why a developer couldn't rewrite the same hash method to obtain the same result ? If you know how to access it within your application, you can always rewrite this code in order to access it with another application.

Colin Hebert 2010-09-17 17:40:17

Yeah, this approach won't work. The thing to do is save the password hash in the database as usual, but use a different hash for the sensitive stuff. See my response.

no 2010-09-17 17:55:45

I agree that I can't absolutely prevent somebody detecting the link between person and it's data, but I assumed that this method would make detecting the link quite a bit harder. In my understanding I would need to have the user being currently logged in to the application to be able to detect the link between him and his data. BTW I didn't understand your thought about being able to rewrite a hash function...

Rene Saarsoo 2010-09-17 18:02:17

Updated my answer with an idea which could interest you.

Colin Hebert 2010-09-17 18:07:43

I don't really see why two separate passwords would be neccessary. Isn't two passwords just the same as one longer password split into half.

Rene Saarsoo 2010-09-17 18:29:21

Two passwords are not necessary. Two hashes are. One hash can come from just the password and be stored in the database in the users table. The other comes from the concatenation of the username and password and is stored in the private info table. That way the user can 'get to' both tables, but the tables can't 'get to' each other.

no 2010-09-17 18:31:30

@Rene, I guess you're right as long as your data for the second pass isn't stored at all in your database.

Colin Hebert 2010-09-17 18:34:52

Answer 2

A:

The problem is that if someone already has full access to the database then it's just a matter of time before they link up the records to particular people. Somewhere in your database (or in the application itself) you will have to make the relation between the user and the items. If someone has full access, then they will have access to that mechanism.

There is absolutely no way of preventing this.

The reality is that by having full access we are in a position of trust. This means that the company managers have to trust that even though you can see the data, you will not act in any way on it. This is where little things like ethics come into play.

Now, that said, a lot of companies separate the development and production staff. The purpose is to remove Development from having direct contact with live (ie:real) data. This has a number of advantages with security and data reliability being at the top of the heap.

The only real drawback is that some developers believe they can't troubleshoot a problem without production access. However, this is simply not true.

Production staff then would be the only ones with access to the live servers. They will typically be vetted to a larger degree (criminal history and other background checks) that is commiserate with the type of data you have to protect.

The point of all this is that this is a personnel problem; and not one that can truly be solved with technical means.

UPDATE

Others here seem to be missing a very important and vital piece of the puzzle. Namely, that the data is being entered into the system for a reason. That reason is almost universally so that it can be shared. In the case of an expense report, that data is entered so that accounting can know who to pay back.

Which means that the system, at some level, will have to match users and items without the data entry person (ie: a salesperson) being logged in.

And because that data has to be tied together without all parties involved standing there to type in a security code to "release" the data, then a DBA will absolutely be able to review the query logs to figure out who is who. And very easily I might add regardless of how many hash marks you want to throw into it. Triple DES won't save you either.

At the end of the day all you've done is make development harder with absolutely zero security benefit. I can't emphasize this enough: the only way to hide data from a dba would be for either 1. that data to only be accessible by the very person who entered it or 2. for it to not exist in the first place.

Regarding option 1, if the only person who can ever access it is the person who entered it.. well, there is no point for it to be in a corporate database.

Chris Lively 2010-09-17 17:32:55

That's what I've thought also... but it's a small startup with only two developers and not much more.

Rene Saarsoo 2010-09-17 17:42:22

@Chris - DB access is not the same as full access. This info can be hidden from DB admins, but someone with root or physical access to the web server could probably still get it. The Q is about protecting the data from those with database access; I think it's completely feasible. Please see my response, I hope it might change your mind.

no 2010-09-17 18:01:48

I don't think that I can't troubleshoot without access to production systems, but I do think I can do it significantly faster. Problems that I could locate in minutes could require hours or days trading emails with the DBAs.

mikerobi 2010-09-17 18:26:41

@no: Sorry, but it can't be hidden from them. All a dba has to do is run profiler to see the queries coming across. From there it would be trivial to match people to hash keys in the OP's example. Easier still if the queries coming across performed joins on the table.

Chris Lively 2010-09-17 19:01:10

@mikerobi: I've seen a lot of dev's say the same thing. However, once separated it actually makes life better because it forces you to have an honest testing environment. Further, if absolutely necessary, a DBA could move the production data to the testing area and simply obfuscate the identifying information. However, I've never seen that actually needed.

Chris Lively 2010-09-17 19:22:07

@mikerobi: And by decent testing environment, I mean one that from a hardware and software perspective (including patch levels) is 100% identical to production.

Chris Lively 2010-09-17 19:23:05

Actually the current scenario really is that the only person who should be allowed to access the data, is the person who entered it. But you bring up a valid point. Although I disagree about storing your personal data in some centeralized server being pointless.

Rene Saarsoo 2010-09-17 19:35:49

@mikerobi: I just remembered one other thing that helped out, at one place there was a sysadmin who would deliver stack traces from the memory dumps when a problem occurred. Those always pointed us to the exact area the problem was in and were much faster than ever trying to trace through ourselves.

Chris Lively 2010-09-17 19:43:03

The expense report situation you're referring to, and the idea that the system needs to know who bought what, AFAICT, is not something the OP said was a restriction. I quote: "Given enough users, it should be almost impossible to tell how much money a particular user has spent by just knowing his name."

no 2010-09-18 03:32:17

If this is going to happen, as you and I know, it's got to be impossible for the system to match them up also. I assume that is his intention, you assume otherwise. That's where our opinions differ, I think.

no 2010-09-18 03:32:57

@no: Yes, there wasn't quite enough info to really detail the solution so we both picked opposite assumptions. ;)

Chris Lively 2010-09-21 19:09:58

Answer 3

A:

Actually, there's a way you could possibly do what you're talking about...

You could have the user type his name and password into a form that runs a purely client-side script which generates a hash based on the name and pw. That hash is used as a unique id for the user, and is sent to the server. This way the server only knows the user by hash, not by name.

For this to work, though, the hash would have to be different from the normal password hash, and the user would be required to enter their name / password an additional time before the server would have any 'memory' of what that person bought.

The server could remember what the person bought for the duration of their session and then 'forget', because the database would contain no link between the user accounts and the sensitive info.

edit

In response to those who say hashing on the client is a security risk: It's not if you do it right. It should be assumed that a hash algorithm is known or knowable. To say otherwise amounts to "security through obscurity." Hashing doesn't involve any private keys, and dynamic hashes could be used to prevent tampering.

For example, you take a hash generator like this:

http://baagoe.com/en/RandomMusings/javascript/Mash.js

// From http://baagoe.com/en/RandomMusings/javascript/
// Johannes Baagoe <[email protected]>, 2010
function Mash() {
  var n = 0xefc8249d;

  var mash = function(data) {
    data = data.toString();
    for (var i = 0; i < data.length; i++) {
      n += data.charCodeAt(i);
      var h = 0.02519603282416938 * n;
      n = h >>> 0;
      h -= n;
      h *= n;
      n = h >>> 0;
      h -= n;
      n += h * 0x100000000; // 2^32
    }
    return (n >>> 0) * 2.3283064365386963e-10; // 2^-32
  };

  mash.version = 'Mash 0.9';
  return mash;
}

See how n changes, each time you hash a string you get something different.

Hash the username+password using a normal hash algo. This will be the same as the key of the 'secret' table in the database, but will match nothing else in the database.
Append the hashed pass to the username and hash it with the above algorithm.
Base-16 encode var n and append it in the original hash with a delimiter character.

This will create a unique hash (will be different each time) which can be checked by the system against each column in the database. The system can be set up be allow a particular unique hash only once (say, once a year), preventing MITM attacks, and none of the user's information is passed across the wire. Unless I'm missing something, there is nothing insecure about this.

no 2010-09-17 17:34:42

no 2010-09-17 17:50:54

The big problem I see with this is when the user changes their password, the hash (and therefore the only link between the account and the data) changes as well. It would be best to use an identifier that will remain static. Perhaps if you had an additional database to map username/password hashes to user IDs, then that might be different.

bta 2010-09-17 17:59:12

Hum, this approach could work, but beware of hashes collisions. In this case it could be really ugly.

Colin Hebert 2010-09-17 18:00:57

bta: that's a good point about password changes. The PW change thing might have to require the additional sensitive login first so the app can know the user's sensitive hash and change it accordingly when they change password.

no 2010-09-17 18:11:36

Colin: Hash collisions could potentially get ugly, but a good algo like sha256, used properly, should support plenty of users without collisions.

no 2010-09-17 18:13:27

@no: What if two users have the same password?

SamB 2010-09-17 18:22:01

@SamB - the hash would be generated from the concatenation of the username and password with some delimiter character that's not allowed in usernames or passwords ... so they'd have to have the exact same username and password ... not an issue if the system requires unique usernames.

no 2010-09-17 18:26:17

@SamB - if the hash was just based off the password, it would already be in the database, and defeat the whole point of this exercise ;)

no 2010-09-17 18:27:24

A client-side script is a really bad way to calculate something this critical/important.

Nelson 2010-09-17 18:42:12

I'm thinking about your suggestion to calculate the hash on the client side. It seems like a minor security advantage over calculating the hash in the server, the only difference being that the server never sees the plain-text password at all, but a significant downside being that the access mechanism is heavily dependant on JavaScript (one wouldn't be able to access the site with lynx).

Rene Saarsoo 2010-09-17 18:46:05

There doesn't appear to be a reason that the hashing needs to occur on the client side. With the mechanism you present, anyone with access to that web application could correlate requests and identify users trivially. It is just as well to perform this hashing server-side if you are only looking to keep the DB admin from correlating user data.I do see an issue with this mechanism for low volume sites. If the database admin is watching queries and there is not enough noise, it would be pretty easy to see a user record queried and then a set of purchase records. Especially when changing pw

Mike S 2010-09-17 18:46:06

@no: well, I was assuming that the password-checking hashes were salted :-).

SamB 2010-09-17 18:56:35

@no: this solves nothing. Someone needs a report on employee spending. Therefore, the system has to match an employee to their spending and spit out a list of names with those items. DBA watches queries come across and sees that "bob is as234bas@" and "sally is cvbx87324" All the hashing in the world won't protect you from a blackhat dba.

Chris Lively 2010-09-17 19:04:36

@Rene - If you're talking about JavaScript this is a web app. You have at least the client side, server-side code, plus database. See my answer and keep the server-side code + database on separate servers.

Nelson 2010-09-17 19:07:20

The reason I suggested doing the hashing on the client was so it could work on non-https pages without passing the password in plain text. If everything is secured, it won't matter.

no 2010-09-18 01:05:48

To those who say it solves nothing and whoever downvoted: it does make things more secure. Maybe not to someone who can watch stuff come across the wire, but certainly to someone who just gets a picture of the data at some point in time (like, say, a copy). This makes things more secure just from the POV that a straight copy of your database, should it get leaked, won't give anything sensitive away, where before it would have.

no 2010-09-18 01:08:11

@Chris Lively: and the more I think about it I'm not sure someone could really determine who's who just by watching the database activity... how would you know which db requests belong to which sessions? And if you can find that out, the 'secret' stuff could just be done in another session... then how can the DBA get this info, short of debugging the webapp as it runs or looking at ips in the server log (not the DBA's job, shouldn't have that ability)?

no 2010-09-18 01:14:35

@Chris Lively: also with this system the "employee spending" report situation you suggest is not feasible; the report would have _no way to know_ which users bought what, because the database doesn't keep track of what user info goes with what spending info, which was the whole point. It's the user's sensitive info, and only they have access to it on an individual level. Reports can still use it for aggregate statistics, etc.

no 2010-09-18 01:23:17

Answer 4

+1 A:

The only way to ensure that the data can't be connected to the person it belongs to is to not record the identity information in the first place (make everything anonymous). Doing this, however, would most likely make your app pointless. You can make this more difficult to do, but you can't make it impossible.

Storing user data and identifying information in separate databases (and possibly on separate servers) and linking the two with an ID number is probably the closest thing that you can do. This way, you have isolated the two data sets as much as possible. You still must retain that ID number as a link between them; otherwise, you would be unable to retrieve a user's data.

In addition, I wouldn't recommend using a hashed password as a unique identifier. When a user changes their password, you would then have to go through and update all of your databases to replace the old hashed password IDs with the new ones. It is usually much easier to use a unique ID that is not based on any of the user's information (to help ensure that it will stay static).

This ends up being a social problem, not a technological problem. The best solutions will be a social solution. After hardening your systems to guard against unauthorized access (hackers, etc), you will probably get better mileage working on establishing trust with your users and implementing a system of policies and procedures regarding data security. Include specific penalties for employees who misuse customer information. Since a single breach of customer trust is enough to ruin your reputation and drive all of your users away, the temptation of misusing this data by those with "top-level" access is less than you might think (since the collapse of the company usually outweighs any gain).

bta 2010-09-17 17:54:14

The separate databases idea looks interesting. BTW, I didn't really meant to put the actual hash into a table in this exact way - rather using an intermediate table that maps hashes to user id-s. But I simplified my original question a lot, and this got simplified away.

Rene Saarsoo 2010-09-17 19:50:03

Answer 5

+2 A:

Create a users table with:
1. user_id: an identity column (auto-generated id)
2. username
3. password: make sure it's hashed!
Create a product table like in your example:
1. user_hash
2. item
3. price

The user_hash will be based off of user_id which never changes. Username and password are free to change as needed. When the user logs in, you compare username/password to get the user_id. You can send the user_hash back to the client for the duration of the session, or an encrypted/indirect version of the hash (could be a session ID, where the server stores the user_hash in the session).

Now you need a way to hash the user_id into user_hash and keep it protected.

If you do it client-side as @no suggested, the client needs to have user_id. Big security hole (especially if it's a web app), hash can be easily be tampered with and algorithm is freely available to the public.
You could have it as a function in the database. Bad idea, since the database has all the pieces to link the records.
For web sites or client/server apps you could have it on your server-side code. Much better, but then one developer has access to the hashing algorithm and data.
Have another developer write the hashing algorithm (which you don't have access to) and stick in on another server (which you also don't have access to) as a TCP/web service. Your server-side code would then pass the user ID and get a hash back. You wouldn't have the algorithm, but you can send all the user IDs through to get all their hashes back. Not a lot of benefits to #3, though the service could have logging and such to try to minimize the risk.
If it's simply a client-database app, you only have choices #1 and 2. I would strongly suggest adding another [business] layer that is server-side, separate from the database server.

Edit: This overlaps some of the previous points. Have 3 servers:

Authentication server: Employee A has access. Maintains user table. Has web service (with encrypted communications) that takes user/password combination. Hashes password, looks up user_id in table, generates user_hash. This way you can't simply send all user_ids and get back the hashes. You have to have the password which isn't stored anywhere and is only available during authentication process.
Main database server: Employee B has access. Only stores user_hash. No userid, no passwords. You can link the data using the user_hash, but the actual user info is somewhere else.
Website server: Employee B has access. Gets login info, passes to authentication server, gets hash back, then disposes login info. Keeps hash in session for writing/querying to the database.

So Employee A has user_id, username, password and algorithm. Employee B has user_hash and data. Unless employee B modifies the website to store the raw user/password, he has no way of linking to the real users.

Using SQL profiling, Employee A would get user_id, username and password hash (since user_hash is generated later in code). Employee B would get user_hash and data.

Nelson 2010-09-17 19:03:04

Also, if you separate the two tables into two different database servers, you now need access to 3 things: Users table, product table, and server-side/web service hashing algorithm. Chances are if they can get into one database they have access to the other, but it is still less risky.

Nelson 2010-09-17 19:08:56

*If you don't have any need to connect data from different sessions*, then you could use a different, random user_hash every time they log in. You would only have to store the hash for the duration of the session. After that you wouldn't have any way of knowing which user_id went to which user_hash. You could still link the data written in that session for reporting or whatever you need.

Nelson 2010-09-17 19:16:29

SQL Profiler defeats all of this as the queries themselves will give it away.

Chris Lively 2010-09-17 19:24:55

@Chris - Not within one query. You could see `username='foo' and password='bar'` and then the next query `select * from products where user_hash=blah`. You can assume the two are related. If you separate database servers, you won't see both queries unless you have access to both.

Nelson 2010-09-17 19:26:43

@Nelson: It doesn't have to be one query. It just has to come from experience watching the trends and knowing the application behavior. ie: after successful login your app will generally run the same follow up queries. Because users log in and out of apps throughout the day it's relatively easy to profile what those steps are and therefore easy to figure out A => B

Chris Lively 2010-09-17 19:35:03

@Nelson: I'd agree that separate databases having their own separate dba's would stop them (no linked queries allowed!); but the other half of the OP's question was about a developer with access to the web server. Even with 2 db servers, the data has to be collated back for reports.. A developer or other sys admin could simply take a memory dump whenever they want to catch the data.

Chris Lively 2010-09-17 19:41:38

@Chris: See my edits which are clearer on separation of DB servers. Unless you want the reports to say which username is linked to each piece of data, there is no reason to join/collate. You can generate "anonymous user" reports.

Nelson 2010-09-17 19:45:34

@Chris: Sure, if you take a memory dump after the user_hash is received and the user/pass haven't been cleared you could join the two. In that case you can send the user/pass, clear it out, then wait for the response to make it a little harder. I agree it's not impossible to link the two if you start analyzing memory or change the source code on the website.

Nelson 2010-09-17 19:49:28

@Nelson: Let's say I'm the accountant. I need a list of all people who submitted an expense report so that I can write them a check. Or I'm a supervisor and need to see my employees and what they've done...

Chris Lively 2010-09-17 19:52:50

@Nelson, all I'm saying is that we can jump through all these hoops to stop particular groups of people, but there is ALWAYS a point at which you have to trust someone with the keys to everything. Which is why I say it's personnel problem, not a technical one.

Chris Lively 2010-09-17 19:54:33

@Chris: Security usually isn't 100%. You just have to make it hard enough so the effort:gain ratio isn't worth it. Current hash algorithms may take 100 years to go through brute-force. If you have enough money and can dedicate a million computers to cracking one hash, you can probably do it, but it's not worth it. When you give someone server access you are often decreasing the effort side of the equation orders of magnitude. It's a risk every company must take. That said, there is a place for proper security so it does take 100 years to crack something. :)

Nelson 2010-09-17 19:57:16

@Chris: Depends on the nature of the data. With anonymous surveys I can't imagine having to link back to the user. And yes, if the supervisor has access to both servers, or says, "Developer B, store all the future login/hash info" it can be done. It's definitely a personnel problem in that sense, though there are mitigating steps one can take. I think we both agree, though my answer didn't say "it can always be undone".

Nelson 2010-09-17 20:01:15

@Nelson: You're right in that we are very close in agreement ;). The only difference appears to be in what the system is built for. If the data does not need to be collated, then your approach absolutely solves the problem. However, if it does need to collate for reports, then there will always be someone with access to everything. BTW, +1.

Chris Lively 2010-09-17 20:14:57

@Chris: I wouldn't say it "absolutely solves it" in this case, but at least much better than nothing. Anyway, this was probably the hardest +1 I have received! :)

Nelson 2010-09-17 20:50:33

hashing on the client isn't necessarily insecure. It should be assumed that all algorithms are known. I'll update my post.

no 2010-09-18 03:42:41

@no - Right, but by hashing on the client you either base it on the entered user/pass (which could change in the future) or you have to send the userid back, which is unnecessarily exposing data and adding a roundtrip--not in the least efficient.

Nelson 2010-09-21 20:35:23

Answer 6

+1 A:

Keep in mind that even without actually storing the person's identifying information anywhere, merely associating enough information all with the same key could allow you to figure out the identity of the person associated with certain information. For a simple example, you could call up the strip club and ask which customer drove a Ferrari.

For this reason, when you de-identify medical records (for use in research and such), you have to remove birthdays for people over 89 years old (because people that old are rare enough that a specific birthdate could point to a single person) and remove any geographic coding that specifies an area containing fewer than 20,000 people. (See http://privacy.med.miami.edu/glossary/xd_deidentified_health_info.htm)

AOL found out the hard way when they released search data that people can be identified just by knowing what searches are associated with an anonymous person. (See http://www.fi.muni.cz/kd/events/cikhaj-2007-jan/slides/kumpost.pdf)

Gabe 2010-09-17 20:04:43

Answer 7

A:

It seems like you're right on track with this, but you're just over thinking it (or I simply don't understand it)

Write a function that builds a new string based on the input (which will be their username or something else that cant change overtime)

Use the returned string as a salt when building the user hash (again I would use the userID or username as an input for the hash builder because they wont change like the users' password or email)

Associate all user actions with the user hash.

No one with only database access can determine what the hell the user hashes mean. Even an attempt at brute forcing it by trying different seed, salt combinations will end up useless because the salt is determined as a variant of the username.

I think you've answered you own question with your initial post.

John 2010-09-17 20:18:39

I think the assumption is that the user's name and personal info need to be stored somewhere in the database too, and the question is how to keep that info and the 'secret' info separate.

no 2010-09-18 01:19:37

ansaurus

tags:

views:

answers:

How to separate a person's identity from his personal data?

related questions