What are techniques and practices on measuring data quality?

views:

answers:

+1 Q:

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?

An example would be if I have a crate holding 12 widgets, and I know each widget weighs 1 lb, there should be some data quality 'check' making sure the case weighs 13 lbs maybe.

Another example would be that if I have a lamp and an image representing that lamp, it should look like a lamp. Perhaps the image dimensions should have the same ratio of the lamp dimensions.

With the exception of images, my data is 99% text (which includes height, width, color...).

I've studied AI in school, but have done very little outside of that.

Are standard AI techniques the way to go? If so, how do I map a problem to an algorithm? Are some languages easier at this than others? Do they have better libraries?

thanks.

This is a tough answer. For example, what defines a lamp? I could google images a picture of some crazy looking lamps. Or even, look up the definition of a lamp (http://dictionary.reference.com/dic?q=lamp). Theres no physical requirements of what a lamp must look like. Thats the crux of the AI problem.

As for data, you could setup Unit testing on the project to ensure that 12 widget() weighs less than 13 lbs in the widetBox(). Regardless, you need to have the data at hand to be able to test things like that.

I hope i was able to answer your question somewhat. Its a bit vauge, and my answers are broad, but hopefully it'll at least send you in a good direction.

cyberconte 2009-05-14 20:13:29

+1 A:

Your question is somewhat open-ended, but it sounds like you want is what is known as a "classifier" in the field of machine learning.

In general, a classifier takes a piece of input and "classifies" it, ie: determines a category for the object. Many classifiers provide a probability with this determination, and some may even return multiple categories with probabilities on each.

Some examples of classifiers are bayes nets, neural nets, decision lists, and decision trees. Bayes nets are often used for spam classification. Emails are classified as either "spam" or "not spam" with a probability.

For you question you'd want to classify your objects as "high quality" or "not high quality".

The first thing you'll need is a bunch of training data. That is, a set of objects where you already know the correct classification. One way to obtain this could be to get a bunch of objects and classify them by hand. If there are too many objects for one person to classify you could feed them to Mechanical Turk.

Once you have your training data you'd then build your classifier. You'll need to figure out what attributes are important to your classification. You'll probably need to do some experimentation to see what works well. You then have your classifier learn from your training data.

One approach that's often used for testing is to split your training data into two sets. Train your classifier using one of the subsets, and then see how well it classifies the other (usually smaller) subset.

Laurence Gonsalves 2009-05-14 21:04:09

+1 A:

AI is one path, natural intelligence is another.

Your challenge is a perfect match to Amazon's Mechanical Turk. Divvy your data space up into extremely small verifiable atoms and assign them as HITs on Mechanical Turk. Have some overlap to give yourself a sense of HIT answer consistency.

There was a shop with a boatload of component CAD drawings that needed to be grouped by similarity. They broke it up and set it loose on Mechanical Turk to very satisfying results. I could google for hours and not find that link again.

See here for a related forum post.

JosefAssad 2009-05-14 21:09:54

ansaurus

tags:

views:

answers:

What are techniques and practices on measuring data quality?

related questions