views:

138

answers:

9

I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:

[[ do, re, fa, ti ],
 [ do, re, mi ],
 [ do, la, ti, za ],
 [ mi, fa ],
 [ re, so, za ]]

The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.

I'd like to be able to take that input and generate a sorted list of all events:

[ do, re, mi, fa, so, la, ti, za ]

or better yet, some information about each event, like a count:

[ [do, 3], [re, 3], [mi, 2],
  [fa, 2], [so, 1], [la, 1],
  [ti, 1], [za, 2] ]

Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.

I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.

A: 
perl -de 0
  DB<1> @a = ( ['a','b','c'], ['c','f'], ['h'] ) 
  DB<2> map { @m{@{$_}} = @$_ } @a
  DB<3> p keys %m
chabf

Quickiest shortcut I can think of. Either way, you have to iterate through things at least once...

eruciform
This doesn't work because *order* is the most important bit. Your output has "c" before "a" and "b", despite the fact that your first input list has "c" after both of them.
Plutor
yeah, this was a simplistic solution. but to be fair, the question said that they didn't care about ordering. :-)
eruciform
A: 

This is a perfect candidate for a Merge Sort. Go to the wikipedia page here for a pretty good representation of the algorithm http://en.wikipedia.org/wiki/Merge_sort

What you have described is actually a subset/small tweak of the merge sort. Instead of starting with an unsorted array, you have a set of sorted arrays that you want to merge together. Just call the "merge" function as described in the wikipedia page on pairs of your arrays and the results of the merge function until you have a single array (which will be sorted).

To tweak the output to the way you want, you'll need to define a comparison function that can return if one event is less than, equal to, or greater than a different event. Then, when your merge function finds two events that are equal, you can collapse them into a single event and keep a count for that event.

J Teller
The values aren't inherently sortable. I've changed them from A-H to "do", "re", "mi", etc, to make that more clear.
Plutor
Ah -- I get it. In your example, the arrays were sorted by the event -- you're saying that the events in the original arrays are sorted by a different key, and while that key has been "lost", you want to retain the original order of each part of the array, right?There's a way to do it fairly easily with the merge sort by re-writing the comparison function to return less than and greater than based on the original order of the inputs. I'll th ink about it after lunch :-)
J Teller
+3  A: 

Theoretically speaking, let me suggest the following algorithm:

  1. Build a directed graph.
  2. For each input [ X, Y, Z ], create the edges X->Y and Y->Z if they're not already there.
  3. Perform a topological sorting of the graph.
  4. Voila!

PS
This is only assuming that all events occur in a specific order (always!). If that's not the case, the problem becomes NP-Complete.

PPS
And just so that you have something useful: Sort::Topological (don't know if it actually works but it seems right)

adamk
+1 for first answer that actually works for unordered elements.
tzaman
A: 

Roughly, the name I would give it is "hashing". You are putting things into name value pairs. If you want to keep some semblance of order, you have to supplement the hash with an array that keeps order. That order is "encounter order" for me.

use strict;
use warnings;

my $all 
    = [[ 'do', 're', 'fa', 'ti' ],
       [ 'do', 're', 'mi' ],
       [ 'do', 'la', 'ti', 'za' ],
       [ 'mi', 'fa' ],
       [ 're', 'so', 'za' ]
     ];

my ( @order, %counts );

foreach my $list ( @$all ) { 
    foreach my $item ( @$list ) { 
        my $ref = \$counts{$item}; # autovivs to an *assignable* scalar.
        push @order, $item unless $$ref;
        $$ref++;
    }
}

foreach my $key ( @order ) { 
    print "$key: $counts{$key}\n";
}

# do: 3
# re: 3
# fa: 2
# ti: 2
# mi: 2
# la: 1
# za: 2
# so: 1

There are other answers like this one, but mine contains this neat autovivification trick.

Axeman
+2  A: 

If you're not into writing to much code, you could use the unix command-line utility tsort:

$ tsort -
do re
re fa
fa ti
do re
re mi
do la
la ti
ti za
mi fa
re so
so za

Which is a list of all pairs in your sample input. This produces as output:

do
la
re
so
mi
fa
ti
za

which is basically what you want.

adamk
+3  A: 

You can use tsort to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in reading tsort's original use, which is similar in structure to your problem.

Note that tsort requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.

#! /usr/bin/perl

use warnings;
use strict;

use IPC::Open2;

sub tsort {
  my($events) = @_;

  my $pid = open2 my $out, my $in, "tsort";

  foreach my $group (@$events) {
    foreach my $i (0 .. $#$group - 1) {
      print $in map "@$group[$i,$_]\n", $i+1 .. $#$group;
    }
  }

  close $in or warn "$0: close: $!";

  chomp(my @order = <$out>);
  my %order = map +(shift @order => $_), 0 .. $#order;
  wantarray ? %order : \%order;
}

Because you described the data as sparse, the code above provides tsort with as much information as possible about the events' adjacency matrix.

Having that information, computing a histogram and sorting its components is straightforward:

my $events = [ ... ];

my %order = tsort $events;

my %seen;
do { ++$seen{$_} for @$_ } for @$events;

my @counts;
foreach my $event (sort { $order{$a} <=> $order{$b} } keys %seen) {
  push @counts => [ $event, $seen{$event} ];
  print "[ $counts[-1][0], $counts[-1][1] ]\n";
}

For the input in your question you provided, the output is

[ do, 3 ]
[ la, 1 ]
[ re, 3 ]
[ so, 1 ]
[ mi, 2 ]
[ fa, 2 ]
[ ti, 2 ]
[ za, 2 ]

This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by $events: we know only that they must both come after do.

Greg Bacon
This is the best solution because it doesn't require me to implement the actual algorithm.
Plutor
A: 

I'm not really sure what this would be called either, but I figured out a way to find the order given the array of arrays as an input. Essentially the pseudo-code is:

10 Find earliest item in all arrays
20 Push that onto a list
30 Remove that item from all arrays
40 Goto 10 if there are any items left

Here's a working prototype:

#!/usr/bin/perl

use strict;

sub InList {
    my ($x, @list) = @_;
    for (@list) {
        return 1 if $x eq $_;
    }
    return 0;
}

sub Earliest {
    my @lists = @_;
    my $earliest;
    for (@lists) {
        if (@$_) {
            if (!$earliest
                || ($_->[0] ne $earliest && InList($earliest, @$_))) {

                $earliest = $_->[0];
            }
        }
    }
    return $earliest;
}

sub Remove {
    my ($x, @lists) = @_;

    for (@lists) {
        my $n = 0;
        while ($n < @$_) {
            if ($_->[$n] eq $x) {
                splice(@$_,$n,1);
            }
            else {
                $n++
            }
        }
    }
}

my $list = [
    [ 'do', 're', 'fa', 'ti' ],
    [ 'do', 're', 'mi' ],
    [ 'do', 'la', 'ti', 'za' ],
    [ 'mi', 'fa' ],
    [ 're', 'so', 'za' ]
];

my @items;

while (my $earliest = Earliest(@$list)) {
    push @items, $earliest;
    Remove($earliest, @$list);
}

print join(',', @items);

Output:

do,re,mi,fa,la,ti,so,za

Bizangles
+1  A: 

Solution:

This solves the original question before it was modified by the asker.


#!/usr/local/bin/perl -w
use strict; 

   main();

   sub main{
      # Changed your 3-dimensional array to a 2-dimensional array
      my @old = (
                   [ 'do', 're', 'fa', 'ti' ],
                   [ 'do', 're', 'mi' ],
                   [ 'do', 'la', 'ti', 'za' ],
                   [ 'mi', 'fa' ],
                   [ 're', 'so', 'za' ]
                );
      my %new;

      foreach my $row (0.. $#old ){                           # loop through each record (row)
         foreach my $col (0..$#{$old[$row]} ){                # loop through each element (col)                    
            $new{ ${$old[$row]}[$col] }{count}++;
            push @{ $new{${$old[$row]}[$col]}{position} } , [$row,$col];
         }
      }

      foreach my $key (sort keys %new){
         print "$key : $new{$key} " , "\n";                   # notice each value is a hash that we use for properties 
      }      
   } 

How to Retrieve Info:

   local $" = ', ';                       # pretty print ($") of array in quotes
   print $new{za}{count} , "\n";          # 2    - how many there were
   print "@{$new{za}{position}[1]} \n";   # 4,2  - position of the second occurrence
                                          #        remember it starts at 0   

Basically, we create a unique list of elements in the hash. For each of those elements we have a "property" hash, that contains a scalar count and an array for the position. The number of elements in the array should vary, based on how many occurrences of the element were in the original.

The scalar property isn't really necessary since you could always take the scalar of the position array to retrieve the same number. Note: if you ever add/remove elements from the array count and position will not be correlate in their meaning.

  • example: print scalar @{$new{za}{position}}; will give you the same as print $new{za}{count};
vol7ron
This would look a little cleaner if you just had a 2-dimensional array, rather than an unnecessary 3-dimensional array.
vol7ron
Thanks for the example on how to get the data! I think the hash of a hash is what he'd be interested in
Armando
A: 

Just realized your question said their is no predetermined order, so this may not be relevent.

Perl code:

$list = [
    ['do', 're', 'fa', 'ti' ],
    ['do', 're', 'mi' ],
    ['do', 'la', 'ti', 'za' ],
    ['mi', 'fa' ],
    ['re', 'so', 'za' ]
];
%sid = map{($_,$n++)}qw/do re mi fa so la ti za/;

map{map{$k{$_}++}@$_}@$list;
push @$result,[$_,$k{$_}] for sort{$sid{$a}<=>$sid{$b}}keys%k;

print "[@$_]\n" for(@$result);

output:

[do 3]
[re 3]
[mi 2]
[fa 2]
[so 1]
[la 1]
[ti 2]
[za 2]
pdehaan
this only works if you know and list the original order of all possible events.
vol7ron