views:

297

answers:

2

I have a list with the following content:

VIP NAME DATE  ARRIVE_TIME FLIGHT_TIME

1  USER1 11-02    20.00    21.00
3  USER2 11-02    20.45    21.45
4  USER2 11-03    20.00    21.30
2  USER1 11-04    17.20    19.10

I want to sort this and similar lists with a shell script. The result should be a new list with lines that do not collide. VIP 1 is most important, if any VIP with a bigger number has ARRIVE_TIME before FLIGHT_TIME for VIP 1 on the same date this line should be removed, so the VIP number should be used to decide which lines to keep if the ARRIVE_TIME, FLIGHT_TIME and DATE collide. Similarly, VIP 2 is more important than VIP 3 and so on.

This is pretty advanced, and I am totally empty for ideas on how to solve this.

+2  A: 

You can use the unix sort command to do this:

There's an example of how to set primary and secondary keys etc:

Example

The uniq command is what you need to remove dupes.

Benj
This misses the point in that the situation for identifying a duplicate is extremely non-trivial. The question is badly framed (possibly because of unrealistic data and/or criteria), but the 'duplicates' are not simple duplicate lines; there is a different VIP number, for example, but the second is less important than the first and therefore gets bumped.
Jonathan Leffler
I don't know how to frame the question any better, but i think that it is somehow clear what i want to do.
J. Smith
+1  A: 

This might get you started:

  • I'm ignoring the header line. You can get rid of it using head or skip it in the for loop.
  • Sort the flights by date, arrival, departure and vip number - having the vip number as a sort key simplifies the logic later.
  • I'm saving the result in an array, but you could redirect it to a temporary file and read it in a line at a time with a while read line; do ...; done <tempfile loop.
  • I'm using indirection to make things more readable (naming the fields instead of using array indices directly - the exclamation point means indirection here instead of "not")
  • For each line in the result that occurs on the same date as the most recently printed line, compare its arrival time to the previous flight's departure time
  • Echo the lines that are appropriate.
  • save the date and departure time for later comparison.
  • You should adjust the < comparison to be <= if that works better for your data.

Here is the script:

#!/bin/bash
saveIFS="$IFS"
IFS=$'\n'
flights=($(sort -k3,3 -k4,4n -k5,5n -k1,1n flights ))
IFS="$saveIFS"

date=fields[2]
arrive=fields[3]
depart=fields[4]

for line in "${flights[@]}"
do
    fields=($line)
    if [[ ${!date} == $prevdate && ${!arrive} < $prevdep ]]
    then
        echo "deleted: $line"    # or you could do something else here
    else
        echo $line
        prevdep=${!depart}
        prevdate=${!date}
    fi
done
Dennis Williamson
This seems interesting, i need to take a closer look at this
J. Smith