tags:

views:

290

answers:

5
List-1    List-2
one       one
two       three
three     three
four      four
five      six
six       seven
eight     eighttt
nine      nine

Looking to output

one       | one        PASS
two       | *               FAIL MISSING
three     | three      PASS
*         | three           FAIL EXTRA
four      | four       PASS
five      | *               FAIL MISSING
six       | six        PASS
*         | seven           FAIL EXTRA
eight     | eighttt         FAIL INVALID
nine      | nine       PASS

Actually the return from my current solution is a reference to the two modified lists and a reference to a "fail" list describing the failure for the index as either "no fail", "missing", "extra", or "invalid" which is also (obviously) fine output.

My current solution is:

sub compare {
    local $thisfound = shift;
    local $thatfound = shift;
    local @thisorig = @{ $thisfound };
    local @thatorig = @{ $thatfound };
    local $best = 9999; 

    foreach $n (1..6) {
        local $diff = 0;
        local @thisfound = @thisorig;
        local @thatfound = @thatorig;
        local @fail = ();
        for (local $i=0;$i<scalar(@thisfound) || $i<scalar(@thatfound);$i++) {
            if($thisfound[$i] eq $thatfound[$i]) { 
                $fail[$i] = 'NO_FAIL';
                next;
            }
            if($n == 1) {      # 1 2 3
                next unless __compare_missing__();
                next unless __compare_extra__();
                next unless __compare_invalid__();
            } elsif($n == 2) { # 1 3 2
                next unless __compare_missing__();
                next unless __compare_invalid__();
                next unless __compare_extra__();
            } elsif($n == 3) { # 2 1 3
                next unless __compare_extra__();
                next unless __compare_missing__();
                next unless __compare_invalid__();
            } elsif($n == 4) { # 2 3 1
                next unless __compare_extra__();
                next unless __compare_invalid__();
                next unless __compare_missing__();
            } elsif($n == 5) { # 3 1 2
                next unless __compare_invalid__();
                next unless __compare_missing__();
                next unless __compare_extra__();
            } elsif($n == 6) { # 3 2 1
                next unless __compare_invalid__();
                next unless __compare_extra__();
                next unless __compare_missing__();
            }
            push @fail,'INVALID'; 
            $diff += 1;
        }
        if ($diff<$best) {
            $best = $diff;
            @thisbest = @thisfound;
            @thatbest = @thatfound;
            @failbest = @fail;
        }
    }
    return (\@thisbest,\@thatbest,\@failbest)
}

sub __compare_missing__ {
    my $j;
    ### Does that command match a later this command? ###
    ### If so most likely a MISSING command           ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$i]) {
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'MISSING'); }
            @end = @thatfound[$i..$#thatfound];
            @thatfound = @thatfound[0..$i-1];
            for ($i..$j-1) { push(@thatfound,'*'); }
            push(@thatfound,@end);
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

sub __compare_extra__ {
    my $j;
    ### Does this command match a later that command? ###
    ### If so, most likely an EXTRA command           ###
    for($j=$i+1;$j<scalar(@thatfound);$j++) {
        if($thatfound[$j] eq $thisfound[$i]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'EXTRA'); }
            @end = @thisfound[$i..$#thisfound];
            @thisfound = @thisfound[0..$i-1];
            for ($i..$j-1) { push (@thisfound,'*'); }
            push(@thisfound,@end);
            $i=$j-1;
            last; 
        }
    }
    $j == scalar(@thatfound);
}

sub __compare_invalid__ {
    my $j;
    ### Do later commands match?                      ###
    ### If so most likely an INVALID command          ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$j]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'INVALID'); }
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

But this isn't perfect ... who wants to simplify and improve? Specifically ... within a single data set, one order of searching is better for a subset and another order is better for a different subset.

A: 

The trick in Perl (and similar languages) is the hash, which doesn't care about order.

Suppose that the first array is the one that hold the valid elements. Construct a hash with those values as keys:

  my @valid = qw( one two ... );
  my %valid = map { $_, 1 } @valid;

Now, to find the invalid elements, you just have to find the ones not in the the %valid hash:

  my @invalid = grep { ! exists $valid{$_} } @array;

If you want to know the array indices of the invalid elements:

  my @invalid_indices = grep { ! exists $valid{$_} } 0 .. $#array;

Now, you can expand that to find the repeated elements too. Not only do you check the %valid hash, but also keep track of what you have already seen:

 my %Seen;
 my @invalid_indices = grep { ! exists $valid{$_} && ! $Seen{$_}++ } 0 .. $#array;

The repeated valid elements are the ones with a value in %Seen that is greater than 1:

 my @repeated_valid = grep { $Seen{$_} > 1 } @valid;

To find the missing elements, you look in %Seen to check what isn't in there.

 my @missing = grep { ! $Seen{$_ } } @valid;
brian d foy
What about elements in list 2 that don't exist in list 1? I suppose I could make a hash %this and %that and cross reference using grep ( i wasn't aware there was a grep in perl ... whoopsie) so I'm probably looking at *simpler* though perhaps not the actually functionality I need....
Reed Debaets
Um, that's the @missing example. Maybe our edits crossed on the wire.
brian d foy
A: 

From perlfaq4's answer to How can I tell whether a certain element is contained in a list or array?:


(portions of this answer contributed by Anno Siegel and brian d foy)

Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data. Hashes are designed to answer this question quickly and efficiently. Arrays aren't.

That being said, there are several ways to approach this. In Perl 5.10 and later, you can use the smart match operator to check that an item is contained in an array or a hash:

use 5.010;

if( $item ~~ @array )
    {
    say "The array contains $item"
    }

if( $item ~~ %hash )
    {
    say "The hash contains $item"
    }

With earlier versions of Perl, you have to do a bit more work. If you are going to make this query many times over arbitrary string values, the fastest way is probably to invert the original array and maintain a hash whose keys are the first array's values:

@blues = qw/azure cerulean teal turquoise lapis-lazuli/;
%is_blue = ();
for (@blues) { $is_blue{$_} = 1 }

Now you can check whether $is_blue{$some_color}. It might have been a good idea to keep the blues all in a hash in the first place.

If the values are all small integers, you could use a simple indexed array. This kind of an array will take up less space:

@primes = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31);
@is_tiny_prime = ();
for (@primes) { $is_tiny_prime[$_] = 1 }
# or simply  @istiny_prime[@primes] = (1) x @primes;

Now you check whether $is_tiny_prime[$some_number].

If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:

@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }

Now check whether vec($read,$n,1) is true for some $n.

These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.

If you are testing only once, the standard module List::Util exports the function first for this purpose. It works by stopping once it finds the element. It's written in C for speed, and its Perl equivalent looks like this subroutine:

sub first (&@) {
    my $code = shift;
    foreach (@_) {
        return $_ if &{$code}();
    }
    undef;
}

If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list. This does have the benefit of telling you how many matches it found, though.

my $is_there = grep $_ eq $whatever, @array;

If you want to actually extract the matching elements, simply use grep in list context.

my @matches = grep $_ eq $whatever, @array;
brian d foy
the only issue that I see, is that the order is important ... effectively I am comparing commands sent to a system, these commands are represented as hex sequences, comparing two outputs of two control systems requires not only knowledge of what was sent, but also knowledge of in what order which is why I chose arrays ... perhaps though, setting up a hash such that the mapped value equals the position sent $values{$_} = $n instead of just 1 would provide flexibility to use some of the search tools mentioned.
Reed Debaets
You can maintain order. In my specific answer I showed you how to work with the indices. Once you know the invalid indices, etc, it's your job to figure out what to do with them. The FAQ answer just shows you what's possible. You have to adapt it to your needs.
brian d foy
A: 

From perlfaq4's answer to How do I compute the difference of two arrays? How do I compute the intersection of two arrays?:


Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

@union = @intersection = @difference = ();
%count = ();
foreach $element (@array1, @array2) { $count{$element}++ }
foreach $element (keys %count) {
    push @union, $element;
    push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
    }

Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

brian d foy
we are communicating in near real time here and I really appreciate your willingness to offer as much advice as you have! I still do not believe, however, that these solutions take the order of the lists into account... correct me if I'm wrong.
Reed Debaets
The FAQ just show you what you can do in Perl. You can adapt these to work with the specific order. My specific example shows you how to work with the various indices.
brian d foy
+4  A: 

If the arrays contain duplicate values, the answer is quite a bit more complicated than that.

See e.g. Algorithm::Diff or read about Levenshtein distance.

reinierpost
thanks for the links! Yes the assumption is that there will be duplicate values...
Reed Debaets
I don't think you need to resort to Algorithm::Diff for this. Perhaps you could explain why you think Levenshtein distance is relevant. I'm not seeing it.
brian d foy
I realize my scenarios for output are "MISSING" or deletion, "EXTRA" or insertion, "INVALID" or substitution, and "NO_FAIL" or pass ... these correspond directly to diffs operations ... levenshtein's algorithm accurately gets the least number of modifications to a list of values to make them match ... by identifying what that modification would have to be, I'm able to identify what is missing, invalid, and extra. For now I'm actually passing my lists to sdiff now with a simple pipe file descriptor since I can't install extra modules on my target machines.
Reed Debaets
I agree that the best solution can only be found with a more exact specification. Levenshtein is relevant when you want to find the optimal synchronization of the two lists, you have duplicates, and transposition of two elements isn't possible. (If transposition is as expensive as addition or deletion, we get Damerau-Levenshtein; if it's free, count the numbers of elements at both sides in hashes and take the sum of the differences.) This is quadratic in the sizes of the sequences (I've seen a webpage claim otherwise, but without proof.) diff is faster but not always optimal.
reinierpost
Okay, so now you have a new wrinkle to also change the second list?
brian d foy
@bria: how do you mean a new wrinkle? the deletions and additions are both present in the example.
reinierpost
The new wrinkle is the ordering or changing the list to be ordered.
brian d foy
But lists are always ordered.
reinierpost
Lists are ordered so you know how to go from one element to another. I should have said "sorted".
brian d foy
I'm not sure @Reed Debaets wanted the list to be sorted. A whole field of related problems hides behind this question.
reinierpost
A: 
sub compare {
    local @d = ();

    my $this = shift;
    my $that = shift;
    my $distance = _levenshteindistance($this, $that);

    my @thisorig = @{ $this };
    my @thatorig = @{ $that };

    my $s = $#thisorig;
    my $t = $#thatorig;

    @this = ();
    @that = ();
    @fail = ();

    while($s>0 || $t>0) {
        #                  deletion,    insertion,   substitution
        my $min = _minimum($d[$s-1][$t],$d[$s][$t-1],$d[$s-1][$t-1]);
        if($min == $d[$s-1][$t-1]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,$thatorig[$t]);
            if($d[$s][$t] > $d[$s-1][$t-1]) {
                unshift(@fail,'INVALID');
            } else {
                unshift(@fail,'NO_FAIL');
            }
            $s -= 1;
            $t -= 1;
        } elsif($min == $d[$s][$t-1]) {
            unshift(@this,'*');
            unshift(@that,$thatorig[$t]);
            unshift(@fail,'EXTRA');
            $t -= 1;
        } elsif($min == $d[$s-1][$t]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,'*');
            unshift(@fail,'MISSING');
            $s -= 1;
        } else {
            die("Error! $!");
        }
    }

    return(\@this, \@that, \@fail);

}

sub _minimum {
    my $ret = 2**53;
    foreach $in (@_) {
        $ret = $ret < $in ? $ret : $in;
    }
    $ret;
}

sub _levenshteindistance {
    my $s = shift;
    my $t = shift;
    my @s = @{ $s };
    my @t = @{ $t };

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i] = ();
    }

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i][0] = $i # deletion
    }
    for(my $j=0;$j<scalar(@t);$j++) {
        $d[0][$j] = $j # insertion
    }

    for(my $j=1;$j<scalar(@t);$j++) {
        for(my $i=1;$i<scalar(@s);$i++) {
            if ($s[$i] eq $t[$j]) {
                $d[$i][$j] = $d[$i-1][$j-1];
            } else {
                #                    deletion,      insertion,     substitution
                $d[$i][$j] = _minimum($d[$i-1][$j]+1,$d[$i][$j-1]+1,$d[$i-1][$j-1]+1);
            }
        }
    }

    foreach $a (@d) {
        @a = @{ $a };
        foreach $b (@a) {
            printf STDERR "%2d ",$b;
        }
        print STDERR "\n";
    }

    return $d[$#s][$#t];
}
Reed Debaets