The big problem is identifying what you intend doing about the problem data:
- nothing
- reconstruct from data held elsewhere and accessible via code
- reconstruct the data manually
- delete it (or preferably archive it)
And in order to do that you need to establish how the problem data affects the system/organization and how the resolution will affect the system/organization.
This is your first level of classification. Once you've got this, you need to start identifying specific issues and from this derive a set of semantic rules defining errant patterns.
This should then allow you to define the fixes required, prioritize the work effectively and plan your resource utilization. It should also allow you to prioritize, plan and partially identify root-cause removal.
I'm not sure what your definition of 'huge' is - but I would infer that it means that there are lots of programmers contributing to it - in which case you certainly need to establish standards and procedures for managing the data integrity going forward, just as you should do with performance and security.
The rules you have defined are a starting point for ongoing data management, but you should think about how you are going to apply these going forward - adding a timestamp field to every table / maintaining tables referencing rows which violate specific rules means that you won't need to process all the data every time you want to check the data - just the stuff which has changed since the last time you checked - its a good idea to keep track of the cases being removed from the violation list as well as the ones being added.
Do keep records of fixes applied and corresponding rule violations - and analyse the data to identify hotspots where re-factoring may result in more maintainable code.