I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.
For example
test1 = [
{ sender => 'client', msg => '123', arg => '900', foo => 'bar', ... },
{ sender => 'server', msg => '456', arg => '800', foo => 'bar', ... },
{ sender => 'client', msg => '789', arg => '900', foo => 'bar', ... },
]
I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above
- foo is always 'bar', so I don't need to mention it
- sender and client are correlated, so I only need to mention one or the other
- and msg is different each time
So I would like to be able to regenerate these messages with a program along the lines of
write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )
where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.
Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?