tags:

views:

167

answers:

3

I have a tab;e set up as follows

id
origin
destination
carrier_id

so typical row could be,

100: London    Manchester  366

Now each route goes both ways, so there shouldn't be a row like this

233: Manchester    London    366

since that's essentially the same route (for my purposes anyway)

Unfortunately though, i have wound up with a handful of duplicates. I have over 50,000 routes made up of around 2000 point of origin (or destination, however you want to look at it) in the table. So i'm thinking looping through each point of origin to find duplicates would be insane.

So I don't even know where to start trying to figure out a query to identify them. Any ideas?

A: 

Bummer! Off the top of my head (and in psuedo-sql):

select * from (
  select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
  union
  select id, concat(destination, '_', origin, '_', carrier_id) as key from ....

) having count(key) > 1;

For the records above, you'd end up with:

100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366

That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.

Todd R
A: 

If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366

You might be able to do something like this:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:

So that you at least have the pairs grouped together. Not sure what would be the best move from there.


Okay, here's a beast of a command line:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester

where m.txt has these new contents:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90

Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.

If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.

Clear as mud?

Mark Rushakoff
+3  A: 

I think you just need a double join, the following will identify all the "duplicate" records joined together.

Here's an example.

Say SELECT * FROM FLIGHTS yielded:

id  origin   destination  carrierid
1   toronto  quebec   1
2   quebec  toronto   2
3   edmonton calgary   3
4   calgary  edmonton   4
5   hull  vancouver   5
6   vancouveredmonton   6
7   edmonton toronto   7
9   edmonton quebec   8
10   toronto  edmonton  9
11   quebec   edmonton  10
12   calgary  lethbridge 11

So there's a bunch of duplicates (4 of the routes are duplicates of some other route).

select  *
from    flights t1 inner join flights t2 on t1.origin = t2.destination 
     AND t2.origin = t1.destination

would yield just the duplicates:

id  origin   destination carrierid  id origin destination carrierid
1   toronto quebec      1 2 quebec toronto 2
2   quebec toronto      2 1 toronto quebec 1
3   edmonton calgary 3 4 calgary edmonton 4
4   calgary edmonton 4 3 edmonton calgary 3
7   edmonton toronto 7 10 toronto edmonton 9
9   edmonton quebec 8 11 quebec edmonton 10
10  toronto edmonton 9 7 edmonton toronto 7
11  quebec  edmonton 10 9 edmonton quebec 8

At that point you just might delete all the ones that occurred 1st.

delete from flights
where id in (
    select  t1.id
    from    flights t1 inner join flights t2 on t1.origin = t2.destination 
         AND t2.origin = t1.destination
)

Good luck!

Tyler
I think you might also need "AND t1.carrier = t2.carrier" in your join statement. I suspect that it would only be a duplicate journey if both directions were provided by the same carrier. Hopefully, the OP can clarify the matching rules.
Convict
Well after I added the carrier clause, I got exactly what I needed. But I can't get the delete working. I get the message: "You can't specify target table 'routes' for update in FROM clause". Which from what I've read elsewhere means wrapping the sub query in another sub-query.... somehow. My ham-fisted attempts
gargantaun
.. made mysql physically sick.
gargantaun