views:

138

answers:

3

While clensing PII from test data I have been stuck with a challenging scenario: cascading the changes through the foreign key relationships in the data. Given the focus on privacy and regulations should this practice be discouraged? If the PII itself were not used in any key fashion a neat trick would be to just shuffle the columns.

There are some commercial tools available to address this problem but none of them seem to handle a large variety of databases well.

+4  A: 

Sounds dangerous and stupid and inefficient. Keys should be synthetic ids.

tpdi
I'd be nervous doing business with a company that treated my PII so carelessly.
Michael
But it doesn't seem to be one of known 'laws of database design'.
ojblass
+1  A: 

HIPAA has a concept called the "Unique Patient Identifier" which can be used as we describe to link data: http://www.ncvhs.hhs.gov/app4.htm

Unique Patient Identifier eliminates the need for the repetitive use and disclosure of an individual's personal identification information (i.e. name, age, sex, race, marital status, place of residence, etc.) for routine internal and external communications (e.g. orders, results, medication, consultation, etc.) and protects the privacy of the individual. It helps preserve the patient anonymity while facilitating communication and information sharing. Healthcare is fundamentally a multi-disciplinary process. A Unique Patient Identifier enables the integration and the availability of critically needed information from multi-disciplinary sources and multiple care settings. Therefore, the integrity and security of the patient information depend on the use of a reliable Unique Patient Identifier.

The privacy issue hinges not so much on the identifier itself, but on the security and privacy of the data that the identifier is used to access, and how that access is controlled. My understanding is that typically this means that a system querying for information via a patient identifier should only get back information that can not be pieced together to reveal private information.

Essentially you would generate an artificial key for each person. Even though it is unique to the person, it is not personally identifying, unless you also were to release personally identifiable information along with it. For example, if you let people see only first names with a particular query, but also returned the artificial key, then they now know that artificial key 00003 is associated with first name Bob. now if you allow them to somehow go back and query with 00003 as criteria, and allow them access to the lastname, you can see how they can start to accumulate information. It is important that there be no way for an unauthorized user to get the artifical key and PII returned in the same query, since that would then make the artifical key itself PII. that's my interpretation at least.

AaronLS
A: 

Besides the HIPPA issues, another problem with using PII as a key is that it changes. People get new SSNs when they have their identities stolen. SSNs are also often miskeyed and thus relate the information for the wrong person (thinking more of data imports from other systems here). People (especially female people) often change their names. Differnt people also have the same name (and often, for this reason, databases hold incorrect SSN infomation for them as well becasue they match to the wrong SSN for that name) and thus very little PPI is in fact unique enough to be a key field. Further, PII should be stored in an encrypted field making it an even worse choice for a key field.

HLGEM