views:

91

answers:

1

Hi all,

The question title is probably not correct because part of my question is to try and get some more understanding on the problem.

I am looking for the advantages of making sure data that is imported to a database (simple example: Excel table to Access database) should be given using the same schema and should also be valid to the business requirements.

I have an Excel table containing none normalised data and an Access database with normalised tables.

The Excel table comes from multiple third parties, none of which stick to the same format as each other or the database.

Some of the sources also do not supply all the relevant data.

Example of what could be supplied

contact_key, date, contact_title, reject_name, reject_cost, count_of_unique_contact

count_of_unique_contact is derived from distinct contact_title's and should not be imported. contact_key is sometimes not supplied. title is sometimes unknown and passed in as such "n/a", "name = ??1342", "#N/A" etc. rather random. reject_name is often miss spelled. the fields are sometimes not even supplied, e.g. date and contact_key are missing.

I am trying to find information to help explain the issues with the above. Issues only related to incorrect data or fields making it difficult to have useful data in the database such as not being able to report a trend on reject costs in a month when the date is not supplied. Normalising the excel file is not an option available to me.

Requesting the values and fields in the Excel files to match the business requirements and the format to be the same for every third party that sends them is what I want to do but the request is falling on deaf ears.

I want to explain to the client that inputting fake data and checking for invalid/existing rejects/contacts all the time is wrong and doing it is going to fail or at the best be difficult without constant maintenance of a poor system.

Does anyone have any information on this problem?

Thanks

+3  A: 

This is a common problem; this gets referred to in data processing circles as "garbage in, garbage out". Essentially, what you're running up against is that the data as given is of poor quality; you're correct to recognize that the problem is that it will be hard (if not impossible) to use this data to extract any useful information.

To some extent, this is a problem that should be fixed at the source; whatever your source of your data is, they need to be convinced that the data quality must improve. In the short term, you can sanitize your data; the term refers to removing or cleaning the bad entries to make the remainder of the data (the "good" data) importable into your database. Depending on just what percentage of your data is bad, you may or may not be able to do useful things with the sanitized data once you import it.

At some point, since you're not getting traction with management about the quality of the data, you will simply have to show them that the system is not working as intended because the quality of the data is bad. They'll need to improve their processes at that point to improve the quality of the data you get in at that point. Until then, though, keep pressing for better data; investigate the process of sanitizing the data and see what you can do with the remaining data. Good luck!

McWafflestix
It's a common, and long-standing practice in IT to produce what we used to call an "Edit Report" stating which data failed validation. Be sure to make it clear to management exactly which data is bad, and then be certain to not use the bad data. If you can also save a file with just the bad data in it, and perhaps attach it to an email sent to Management, then that might help.
John Saunders
Very good point; the Edit Report is a particularly good idea to communicate to management just how bad the data actually is...
McWafflestix
Thanks for the replies, nice use of keywords to help me along. I think my problem also will suffer from the fact that the data sources are random with their invalid values. I have actually gone to the steps of adding in tables for invalid values where each item in the table has a relationship to the valid table. Then wrote a macro so the invalid values are replaced with correct ones. This is a manual process and becomes difficult when they have no consistency with what goes in incorrectly each month.
Pricey
Another problem I face is the client expects me to sanitise this data by checking duplicate contacts are not put in.. yet they expect this to work when some third parties send no contact_key or name. I am dealing with 10 columns of data and average of 15000 rows in each file so that quickly mounts up to a lot of work even if 10% is incorrect.
Pricey
@Pricey: I'd recommend asking the client how they want you to check duplicate contacts, and keep drilling down to specifics, until they've pretty much laid out the entire algorithm for you; once you've got that, either it'll be easy to implement, or it'll be so complicated that they'll understand why it's important to get the data in good form.
McWafflestix
@McWafflestix: Thanks again
Pricey