tags:

views:

47

answers:

2

Hello:

I have a table with a very simple schema:

CREATE TABLE q(
 orig INTEGER NOT NULL,
 dest INTEGER NOT NULL,
 cost  FLOAT,
 PRIMARY KEY (orig, dest)
);

I need to walk that table backwards in a cost minimizing way. Let me explain.

I need a tour of 15 points, a point can be an orig or a dest. My algorithm is a backtracking from the last dest to the initial orig. So here is how I spell it out:

Given the final dest, find the orig which would link to said dest with minimum cost. The corresponding orig becomes the new dest; loop over this 15 times.

Let's assume that I know the last dest is number 10. The SQL statement that allows me to find the orig leading to a dest in a cost-minimizing way is:

SELECT orig FROM q WHERE cost = (SELECT MIN(cost) FROM q WHERE dest = 10);

Now I would use the orig returned by the above function to find the previous point (assume it returned, say, point 5):

SELECT orig FROM q WHERE cost = (SELECT MIN(cost) FROM q WHERE dest = 5);

I can go on like this until I have the 15 points.

How to make an efficient query to do this in SQL? My table has 50 million rows.

A: 

Here's a query that assumes you're using SQL Server 2005+. It uses common table expressions (CTE). This example actually returns all paths, with cumulative cost, that have a selected orig and dest. It also shows the number of points in the path. It could be altered to return the best choice (e.g, lowest cost, fewest steps, etc.)

I don't know how performance would be. But, indexes would be helpful.

WITH Paths AS  -- Get list of all paths
    ( SELECT ROW_NUMBER() OVER (ORDER BY orig) AS PathNumber, orig,
        dest, cost, 1 AS points
    FROM q

    UNION ALL

    SELECT Paths.PathNumber, Paths.orig,
        q.dest, q.cost, paths.points + 1 AS points
    FROM Paths
    JOIN q ON Paths.dest = q.orig
    WHERE Paths.points < 15
    )
    , PathsRows AS  -- Get total points in each path
    ( SELECT COUNT(*) OVER (PARTITION BY PathNumber) AS TotalPoints, PathNumber
        , orig, dest, cost, points
    FROM Paths
    )
    , PathsSum AS  -- summarize for each path
    ( SELECT PathNumber,
        MIN(CASE WHEN points = TotalPoints THEN orig END) AS orig,
        MAX(CASE WHEN points = TotalPoints THEN dest END) AS dest,
        SUM(cost) AS cost, MAX(points) AS points
    FROM PathsRows
    GROUP BY PathNumber
    )

SELECT PathNumber, orig, dest, cost, points
FROM PathsSum
WHERE dest = 4
    and orig = 1
ORDER BY PathNumber, points
bobs
From looking at the OP's question tags, I suspect that MySQL is the RDBMS in question - as far as I know, MySQL does not support recursive CTEs.
Mark Bannister
A: 

Short answer: you can't.

Longer answer: assuming I have understood your question correctly, you could write a SQL statement that was as efficient as possible, but it would be highly unlikely to return a result within a reasonable timeframe - say, the age of the universe so far.

Assuming your table of 50 million rows includes no duplicates, and maps the direct paths from all locations to all other locations, then the number of locations included is approximately the square root of 50 million - ie. somewhat over 7070 locations.

The number of possible paths that could be taken would therefore be 7070 x 7069 x 7068 ... x 7056, or to put it another way, somewhat larger than 7000^15 (approximately 4.75 x 10^57).

Ultimately, this is a variant of the Travelling Salesman Problem - SQL-based brute force approaches are uttely unsuited to resolving it, with a dataset this large.

Mark Bannister
Yes, I agree with your answer. However, notice that what I'm doing is a backtracking which does not enumerate the combinations - it is simply a feasibility study of routes (not even dynamic programming). I ended up going to a matrix-based approach where the problem I was trying to solve converged in about 10 minutes.
Arrieta
@Arrieta, can you post your solution, as a separate answer? I have to confess I can't visualise it.
Mark Bannister