views:

41

answers:

2

This is a problem I've just run into, or rather its a simplification that captures the core problem.

Imagine I have a spreadsheet containing a number of columns, each of them labeled, and a number of rows.

I want to determine when the value in one column can be inferred from the value in another. For example, we might find that every time a '1' appears in column a, a '5' always appears in column d, but whenever a '2' appears in column a, a 3 always appears in column d. We observe that the value in column a reliably predicts the value in column c.

The goal is to identify all such relationships between columns.

The naive solution is to start with a list of all pairs of columns, (a, b), (a, c), (a, d)... (b, c), (b, d)... and so on. We call these the "eligible" list.

For each of these pairs, we keep track of the value of the first in the pair, and the corresponding value in the second. If we notice that we see the same value for the first of a pair, but a different value for the second of a pair, then this pair is no-longer eligible.

Whatever is left at the end of this process is the set of valid relationships.

Unfortunately, this rapidly becomes impractical as the number of columns increases, as the amount of data we must store is in the order of the number of columns squared.

Can anyone think of an efficient way to do this?

A: 

I don't think you can improve on O(n^2) for n columns: consider the case where no relationship exists between any pair. The only way to discover this is to test all pairs, which is O(n^2).

Rafe
A: 

I suspect you might be best to build up the relation, rather than whittle it down.

You might well have to store n^2 pieces of information, where you have n columns. For example if a column never repeats (ie its value is different on each row) then that column predicts all others. If every column is like that then every column predicts every other. You could use a two dimensional table pred say, indexed by columns numbers, with pred(a,b) true if a predicts b. pred(a,b) could have any of 3 values: true, false and unknown.

The predicts relation is transitive, that is if a predicts b and b predicts c then a predicts c. If the number of rows is large, so that checking if a row predicts another is expensive, then it might be worth using transitivity to fill out what you can: if you have just computed that pred(a,b) is true and you have already computed pred(b,x) for every x, then you can set pred(a,y) true for every y for which pred(b,y) is true.

To fill out pred(a,.) you could build a temporary array of pairs (value,row-index) from a, and then sort by value; this gives you easy access to the sets of indices where a is constant. If each of these sets is a singleton, then pred(a,b) is true for every b; otherwise to check if a predicts b (if its not already known) you need to check that b is constant on each index set (with more than one member) where a is constant.

An optimisation might be that if pred(a,b) is true, and also pred(b,a) is true then for every c, pred(a,c) if and only if pred(b,c); thus in this case if you have already filled out pred(b,.) you can fill out all pred(a,.) by copying.

dmuir
You raise a good point that according to my definition, a column where every column is unique then it can be said to predict all other columns, however this is not really what I intended, as it would have little actual predictive value. I guess I need to refine my definition of the problem.
sanity