tags:

views:

167

answers:

3

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.

For example

   test1 = [
      { sender => 'client',  msg => '123',  arg => '900', foo => 'bar', ... },
      { sender => 'server',  msg => '456',  arg => '800', foo => 'bar', ... },
      { sender => 'client',  msg => '789',  arg => '900', foo => 'bar', ... },
   ]

I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above

  • foo is always 'bar', so I don't need to mention it
  • sender and client are correlated, so I only need to mention one or the other
  • and msg is different each time

So I would like to be able to regenerate these messages with a program along the lines of

write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )

where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.

Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?

+1  A: 

This looks very similar to Database Normalization.

You have a relation (your test data set), and some known functional dependencies ({sender} => arg, {} => foo and possibly {msg} => sender. If the order of tests is important then add {testNr} => msg.) and you want to eliminate redundancies.

Treat your test set as a database table, apply the normalization rules and create equivalent functions (getArgFromSender(sender) etc.) for each join.

finnw
+1  A: 

If the number of fields and records is small:

Brute force it by looping through every combination of fields, and for each combination detect if there are multiple items in the list which map to the same value.

If you can live with a fairly good choice of fields:

Start off assuming you need all fields. Then, select a field at random and see if it can be eliminated; if it can, cross it off the set of fields. Otherwise, choose another field at random and try again. If you find no fields can be eliminated, then you've found a reasonable set of fields. Had you chosen other fields first, you may find a better solution. You can repeat the whole procedure a few times and pick the best solution if you like. This kind of approach is called hill climbing.

(I suspect that this problem is NP complete, i.e. we probably don't know of an efficient and powerful solution so it is not worth losing sleep over trying to dream up a perfect solution.)

Dickon Reed
+3  A: 

The following papers describe algortithms for discovering functional dependencies:

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, 1999, doi:10.1093/comjnl/42.2.100.

I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174–185, Washington, DC, USA, 1993.

C. Wyss, C. Giannella, and E. Robertson. FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich, Germany, 2001, doi:10.1007/3-540-44801-2.

Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.

There has also been some work on discovering multivalued dependencies:

I. Savnik and P. A. Flach. "Discovery of Mutlivalued Dependencies from Relations." Intelligent Data Analysis Journal, 4(3):195–211, IOS Press, 2000.

Vebjorn Ljosa
Hmm. I'm glad to see this problem really is hard, and I haven't been flailing around for no reason. Thanks for collecting those references.
Eric