views:

86

answers:

3

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:

  1. Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?

  2. What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?

  3. Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?

EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:

a = 5
b = 4
b + c = 5
a + b + c = 8

A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.

+3  A: 

Seems like an application of linear algebra.

You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).

For instance if the tasks lengths were t1, t2, t3 for 3 tasks.

And you are given

t1 + t2 = 2  (task 1 and 2 take 2 hours)

t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)

t2 + t3 = 6   (tasks 2 and 3 take 6 hours)

Solving gives t1 = 1, t2 = 1, t3 = 5.

You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).

If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.

Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.

Moron
Do excuse the delay on commenting this. Yes, you're correct in that straightforward linear algebra is one approach, but that only goes for the first, most naive scenario and leaves the beef, scenarios two and three completely untended. This was actually the first solution I came up with, but a prompt realization of the downfalls of this approach left me in a dead end - one to which I wanted to avoid leading any other answers. Seeing the lack of answers, though, I'll be adding up the ante with a bounty shortly.
Ezku
@Ezku: Your question in #2 and #3 isn't really clear. For instance, it is not at all clear why a linear algebra method won't work there. What 'downfalls' are you talking about? Perhaps that will help clarify the question further. Perhaps some examples will help too.
Moron
@Ezku: Added a paragraph to the answer based on your edit of the question.
Moron
Added one more paragraph.
Moron
A: 

Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)

w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)

i = 1

if #w[i] ~= 1 then

replace w[2] with the LENGTH of w[i], minus all other values in w.

repeat forever.

Example:

0,a,b,c,d,e 50

0,a,c,b,e 20

0,c,e 10

So:

a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.

New data:

50, 50

50,-b,-d, 20

0,c,e 10

And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.

TaslemGuy
A: 

I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.

In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.

With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.

When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.

Yes, this can cost a lot :-)

ZXX