tags:

views:

559

answers:

6

What are current practices for enabling developers to build systems that contain private data? Can anyone point to a "best practices" guide for that sort of thing?

We have a Catch-22 here in that developers need to write applications that go against systems that have data that is considered "private." The IT administration would like for us developers to not have access to the data (ie. provide a schema or data structure, but not data itself) whereas most developers (myself included) would like to have access to the production data since not having a representative dataset can lead to bad assumptions (eg. the format of data) and bugs later on.

Does anyone have any formalized "best practices" for this type of thing? Especially official guildines from some "BigCo" (eg. Microsoft, IBM) might help since it is needed to convince management.

+3  A: 

Often times, a subset of sanitized data will be provided that is representative of the private data, but not the private data itself.

scubabbl
That's exactly what we do here, and I can honestly say we're a very large company (i.e. >100k employees)
Abyss Knight
+1  A: 

At MediumCo, we strip proprietary data out of our production data in Test and Dev. It has hurt us a little in the past to not have exactly-representative data, but the clients have asked about this point before, and it's usually not an issue, as the environments are populated with a lot of fake proprietary data.

Tom Ritter
+1  A: 

At my company, we started using Red-gate's data generator to generate test data. There is a bit of setup, but you can use the tools to generate very usable test data. Yes, I would prefer to use live production data, but it's not feasible (especially if you need to consider in HIPAA). It uses regex for each column and allows you to use look-up table's for related tables.

Brian Childress
A: 

I don't have any best practices paper or anything. But I would think that if you're developing out of an environment that is as protected as the environment that hosts the data in production, there wouldn't be a lot of argument to be made against it.

That is, if your production database is in a datacenter hosted and controlled and secured by your IT staff, if you have a development database that lives in the exact same scenario and doesn't offer any new ways to access the information - you would be in pretty good shape. As an added token of good will - it might be nice to offer to allow anyone worried about security a chance to do some kind of penetration test to ensure that you're telling the truth about security.

The other side of this, of course, is the analysis of the cost for not using the data: that is, it will lead to buggier code, which will cost $xxxxxx.xx in development time vs. virtually no cost to allow a small subset of your development team access to said data.

If Test and Dev were _as protected_ as Production, every Developer would either have access to Production, or limited access in Development. Not being able to run a Select query or login in as a different user would severely hamper my development.
Tom Ritter
A: 

To avoid the need to manually sanitise/anonymise data, you could use random text replacement - to replace every alphanumeric character in each text field with a random alphanumeric. This:

  • keeps the data similar in length, size etc. from the developer's point of view
  • does not cause problems with character sets
  • leaves date and number fields untouched, which allows for accurate testing with respect to date ranges and quantities
  • will satisfy most privacy requirements

If you wanted to go a little further you could run random number-for-number replacement on telephone numbers and zip codes, while using alphanumeric replacement on other text fields.

Having an automated replacement script allows you to get up-to-date data dumps from the live system regularly, so your tests are up-to-date with respect to the size and variability of the data in practice.

It does mean that a small number of operations will not be realistic (e.g. indexing on name fields, which in real life are clustered around common letters) but these should be limited.

Leigh Caldwell
+3  A: 

My view of the world may be different, as I'm based in the UK, but for the past 20-odd years, I've worked primarily in the public sector on systems handling sensitive data. The rules are **completely** cut-and-dried. No production data is allowed on the development estate.

As a fundamental principle, we do not want to be responsible for the loss of sensitive data. The users are perfectly good at that, themselves.

Within the past 12 months, my wife has moved from the same regime to one in the private sector where they allow developers access to production data and she's horrified by it. The legal implications (in the UK, at least) can be severe.

Developers don't **need** access to production data. It's simply laziness. Define and create test data to exercise defined test cases (including edge cases) and don't rely on the random-esque nature of production data.

If you **must** use production data (i.e. you manage to convince someone who doesn't know any better that it's acceptable), ensure the data is anonymised **before** it reaches the development estate.

Steve Morgan