views:

194

answers:

4

Okay, here's the situation: We have a table of about 50 columns (created by joining database tables) and several thousand rows. We need to identify a pattern in several known faulty records of that data. Here's a really boiled down example. Given a table:

-----------------------
| id | title | date   |
-----------------------
| 01 | c     | 2009-01|
| 02 | a     | 2009-02|
| 03 | a     | 2009-02|
| 04 | b     | 2009-03| 
| 05 | b     | 2009-03| 
| 06 | a     | 2009-04| 
-----------------------

And I ask the library to tell me how are rows 1, 4 and 5 related? Or, how are they different from all other rows? The library would say:

  • All selected rows have an odd month number
  • All selected rows do not have the title = 'a'

Perhaps the library is iterating through a series of pivot table groupings in excel. Whenever it finds a combinations of groupings and calculations that are interesting, it tells you.

The actual situation (for the curious only): The exact situation is that we found out changes to the data have been 'undone' somehow. Instead of just 'redoing' the changes and hoping they stick, we're trying to figure out why they occured, so that they don't unstick. Here are some of the real columns and possible data patterns:

-----------------------------------------------------
| id | user  | created_on| facility | review_status |
-----------------------------------------------------
| 01 | tom   | 2009-01   | Bay      | Locked        |  
| 02 | berry | 2009-02   | Inner    |               |
| 03 | jan   | 2009-02   | Hamming  | Submited      |
| 04 | bernie| 2009-03   | Youth    | Accepted      |
| 05 | jack  | 2009-03   | Johnson  | Locked        |
| 06 | frank | 2009-04   | Baber St.|               |
-----------------------------------------------------

Our problem is that all of the review statuses (column 5) should have been marked as 'locked,' but weren't.

Anyone know of a pattern-finding library for this kind of stuff? The long answer below hit the nail on the head, DATA MINING software seems to be right on the money, but the solution must be an Open Source OR "free as in beer" solution. Thanks Everyone!

P.S. Petitio principii answers, or answers that make no attempt to answer the initial question will not be considered (actually, they're considered, just not in the way one would expect).

+3  A: 

It sounds like you want to perform some kind of data mining. Potentially you could pass your data to a classification algorithm that could try and cluster it. However you may find yourself taking longer trying to get this to work than it would take you to just go through the data :)

If you've got a bit of spare time one weekend take a look at Pentaho Weka and see what it can do

True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data.

Weka

http://www.cs.waikato.ac.nz/~ml/weka/index.html

There are some nice introductory slides on Data Mining here http://www-users.cs.umn.edu/~kumar/dmbook/index.php

As used by Pentaho http://community.pentaho.com/faq/data%5Fmining.php

pjp
Thanks for the response, but it has to be an open source OR "free as in beer" solution. Forgot to put that in the description.
btelles
@btelles Weka is open source software issued under the GNU General Public License
pjp
DUDE!!! I completely skipped weka in your original answer! Sorry! Is there any way to change my preferred answer?
btelles
Hey ho it happens. It'll be interesting to see how you get on with using it :)
pjp
A: 

If you're using MySQL, running procedure_analyse() with appropriately narrow where clauses may give you some insight.

Autocracy
A: 

"We need to identify a pattern in several known faulty records of that data."

I will not address the issue that your choice of words seems to suggest that the data itself is not wrong, but only the RECORDS of the data.

I just want to point out that "preventing" (faulty records of data) is usually regarded as being cheaper than "curing" the same thing.

"Data faults" are a consequence of having overlooked something in the initial analysis/implementation. Always. So if you want to find "patterns" in the data faults that you have, re-think the initial analysis/implementation and try to figure out which mistakes were made in them.

A: 

Found exactly what we need: Weka Machine Learning Library.

http://www.cs.waikato.ac.nz/ml/weka/

This is a data mining library that is strongly backed by Pentaho. Check out Weka's "related projects" section for even more Open Source goodness.

Lucky for my lazy butt, there's also a Ruby binding library on Rubyforge, Rarff.

btelles
@btelles: if this is the accepted answer, I think a upvoting is necessary.
lmsasu