views:

28

answers:

2

I think this is a long-shot, but here it goes:

The basic question is: how does a development team beginning to repair data integrity on a large, damaged dataset?

The company I'm helping out has a huge MySQL/PHP5 sytem with a few years of cruft, invalid data, broken references, etc. To top it all off, this data references data on a few online services, such as Google AdWords.

So the local db has problems, and the relationships between the local and the remote (e.g. AdWords) also has problems, compounding the issue.

Does anyone have tips, tricks, or best-practices they can share for beginning to repair the data integrity? And to maintain data integrity in a system that is rapidly and continuously being added to and updated?

A: 

Depending on the requirements and how much "damage" exists, it might be prudent to create a new database and modify the application to update both in parallel.

Data which are valid could be imported into the new d/b, and then a progressive series of extractions could add valid data and import those until the effort increases to the point where it no longer makes sense to try to recover seriously damaged data. Surely an undamaged incomplete database is better and more useful than a corrupt database—as long as it's corrupt, it cannot be called "complete".

wallyk
A: 

The big problem is identifying what you intend doing about the problem data:

  • nothing
  • reconstruct from data held elsewhere and accessible via code
  • reconstruct the data manually
  • delete it (or preferably archive it)

And in order to do that you need to establish how the problem data affects the system/organization and how the resolution will affect the system/organization.

This is your first level of classification. Once you've got this, you need to start identifying specific issues and from this derive a set of semantic rules defining errant patterns.

This should then allow you to define the fixes required, prioritize the work effectively and plan your resource utilization. It should also allow you to prioritize, plan and partially identify root-cause removal.

I'm not sure what your definition of 'huge' is - but I would infer that it means that there are lots of programmers contributing to it - in which case you certainly need to establish standards and procedures for managing the data integrity going forward, just as you should do with performance and security.

The rules you have defined are a starting point for ongoing data management, but you should think about how you are going to apply these going forward - adding a timestamp field to every table / maintaining tables referencing rows which violate specific rules means that you won't need to process all the data every time you want to check the data - just the stuff which has changed since the last time you checked - its a good idea to keep track of the cases being removed from the violation list as well as the ones being added.

Do keep records of fixes applied and corresponding rule violations - and analyse the data to identify hotspots where re-factoring may result in more maintainable code.

symcbean