ansaurus

Question

find extra, missing, invalid strings when comparing two lists in perl

Answer 1

A:

The trick in Perl (and similar languages) is the hash, which doesn't care about order.

Suppose that the first array is the one that hold the valid elements. Construct a hash with those values as keys:

  my @valid = qw( one two ... );
  my %valid = map { $_, 1 } @valid;

Now, to find the invalid elements, you just have to find the ones not in the the %valid hash:

  my @invalid = grep { ! exists $valid{$_} } @array;

If you want to know the array indices of the invalid elements:

  my @invalid_indices = grep { ! exists $valid{$_} } 0 .. $#array;

Now, you can expand that to find the repeated elements too. Not only do you check the %valid hash, but also keep track of what you have already seen:

 my %Seen;
 my @invalid_indices = grep { ! exists $valid{$_} && ! $Seen{$_}++ } 0 .. $#array;

The repeated valid elements are the ones with a value in %Seen that is greater than 1:

 my @repeated_valid = grep { $Seen{$_} > 1 } @valid;

To find the missing elements, you look in %Seen to check what isn't in there.

 my @missing = grep { ! $Seen{$_ } } @valid;

brian d foy 2010-01-11 19:12:58

What about elements in list 2 that don't exist in list 1? I suppose I could make a hash %this and %that and cross reference using grep ( i wasn't aware there was a grep in perl ... whoopsie) so I'm probably looking at *simpler* though perhaps not the actually functionality I need....

Reed Debaets 2010-01-11 19:29:33

Um, that's the @missing example. Maybe our edits crossed on the wire.

brian d foy 2010-01-11 19:31:15

Answer 2

A:

From perlfaq4's answer to How can I tell whether a certain element is contained in a list or array?:

(portions of this answer contributed by Anno Siegel and brian d foy)

Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data. Hashes are designed to answer this question quickly and efficiently. Arrays aren't.

That being said, there are several ways to approach this. In Perl 5.10 and later, you can use the smart match operator to check that an item is contained in an array or a hash:

use 5.010;

if( $item ~~ @array )
    {
    say "The array contains $item"
    }

if( $item ~~ %hash )
    {
    say "The hash contains $item"
    }

With earlier versions of Perl, you have to do a bit more work. If you are going to make this query many times over arbitrary string values, the fastest way is probably to invert the original array and maintain a hash whose keys are the first array's values:

@blues = qw/azure cerulean teal turquoise lapis-lazuli/;
%is_blue = ();
for (@blues) { $is_blue{$_} = 1 }

Now you can check whether $is_blue{$some_color}. It might have been a good idea to keep the blues all in a hash in the first place.

If the values are all small integers, you could use a simple indexed array. This kind of an array will take up less space:

@primes = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31);
@is_tiny_prime = ();
for (@primes) { $is_tiny_prime[$_] = 1 }
# or simply  @istiny_prime[@primes] = (1) x @primes;

Now you check whether $is_tiny_prime[$some_number].

If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:

@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }

Now check whether vec($read,$n,1) is true for some $n.

These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.

If you are testing only once, the standard module List::Util exports the function first for this purpose. It works by stopping once it finds the element. It's written in C for speed, and its Perl equivalent looks like this subroutine:

sub first (&@) {
    my $code = shift;
    foreach (@_) {
        return $_ if &{$code}();
    }
    undef;
}

If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list. This does have the benefit of telling you how many matches it found, though.

my $is_there = grep $_ eq $whatever, @array;

If you want to actually extract the matching elements, simply use grep in list context.

my @matches = grep $_ eq $whatever, @array;

brian d foy 2010-01-11 19:29:43

the only issue that I see, is that the order is important ... effectively I am comparing commands sent to a system, these commands are represented as hex sequences, comparing two outputs of two control systems requires not only knowledge of what was sent, but also knowledge of in what order which is why I chose arrays ... perhaps though, setting up a hash such that the mapped value equals the position sent $values{$_} = $n instead of just 1 would provide flexibility to use some of the search tools mentioned.

Reed Debaets 2010-01-11 19:36:38

You can maintain order. In my specific answer I showed you how to work with the indices. Once you know the invalid indices, etc, it's your job to figure out what to do with them. The FAQ answer just shows you what's possible. You have to adapt it to your needs.

brian d foy 2010-01-11 19:54:31

Answer 3

A:

From perlfaq4's answer to How do I compute the difference of two arrays? How do I compute the intersection of two arrays?:

Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

@union = @intersection = @difference = ();
%count = ();
foreach $element (@array1, @array2) { $count{$element}++ }
foreach $element (keys %count) {
    push @union, $element;
    push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
    }

Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

brian d foy 2010-01-11 19:32:28

we are communicating in near real time here and I really appreciate your willingness to offer as much advice as you have! I still do not believe, however, that these solutions take the order of the lists into account... correct me if I'm wrong.

Reed Debaets 2010-01-11 19:39:56

The FAQ just show you what you can do in Perl. You can adapt these to work with the specific order. My specific example shows you how to work with the various indices.

brian d foy 2010-01-11 19:56:15

Answer 4

+4 A:

If the arrays contain duplicate values, the answer is quite a bit more complicated than that.

See e.g. Algorithm::Diff or read about Levenshtein distance.

reinierpost 2010-01-11 19:41:07

thanks for the links! Yes the assumption is that there will be duplicate values...

Reed Debaets 2010-01-11 19:42:18

I don't think you need to resort to Algorithm::Diff for this. Perhaps you could explain why you think Levenshtein distance is relevant. I'm not seeing it.

brian d foy 2010-01-11 19:58:07

I realize my scenarios for output are "MISSING" or deletion, "EXTRA" or insertion, "INVALID" or substitution, and "NO_FAIL" or pass ... these correspond directly to diffs operations ... levenshtein's algorithm accurately gets the least number of modifications to a list of values to make them match ... by identifying what that modification would have to be, I'm able to identify what is missing, invalid, and extra. For now I'm actually passing my lists to sdiff now with a simple pipe file descriptor since I can't install extra modules on my target machines.

Reed Debaets 2010-01-11 20:25:17

I agree that the best solution can only be found with a more exact specification. Levenshtein is relevant when you want to find the optimal synchronization of the two lists, you have duplicates, and transposition of two elements isn't possible. (If transposition is as expensive as addition or deletion, we get Damerau-Levenshtein; if it's free, count the numbers of elements at both sides in hashes and take the sum of the differences.) This is quadratic in the sizes of the sequences (I've seen a webpage claim otherwise, but without proof.) diff is faster but not always optimal.

reinierpost 2010-01-12 09:44:56

Okay, so now you have a new wrinkle to also change the second list?

brian d foy 2010-01-12 11:20:03

@bria: how do you mean a new wrinkle? the deletions and additions are both present in the example.

reinierpost 2010-01-20 15:09:33

The new wrinkle is the ordering or changing the list to be ordered.

brian d foy 2010-01-20 15:37:25

But lists are always ordered.

reinierpost 2010-01-21 13:48:12

Lists are ordered so you know how to go from one element to another. I should have said "sorted".

brian d foy 2010-09-25 18:34:31

I'm not sure @Reed Debaets wanted the list to be sorted. A whole field of related problems hides behind this question.

reinierpost 2010-09-27 06:48:50

Answer 5

A:

sub compare {
    local @d = ();

    my $this = shift;
    my $that = shift;
    my $distance = _levenshteindistance($this, $that);

    my @thisorig = @{ $this };
    my @thatorig = @{ $that };

    my $s = $#thisorig;
    my $t = $#thatorig;

    @this = ();
    @that = ();
    @fail = ();

    while($s>0 || $t>0) {
        #                  deletion,    insertion,   substitution
        my $min = _minimum($d[$s-1][$t],$d[$s][$t-1],$d[$s-1][$t-1]);
        if($min == $d[$s-1][$t-1]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,$thatorig[$t]);
            if($d[$s][$t] > $d[$s-1][$t-1]) {
                unshift(@fail,'INVALID');
            } else {
                unshift(@fail,'NO_FAIL');
            }
            $s -= 1;
            $t -= 1;
        } elsif($min == $d[$s][$t-1]) {
            unshift(@this,'*');
            unshift(@that,$thatorig[$t]);
            unshift(@fail,'EXTRA');
            $t -= 1;
        } elsif($min == $d[$s-1][$t]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,'*');
            unshift(@fail,'MISSING');
            $s -= 1;
        } else {
            die("Error! $!");
        }
    }

    return(\@this, \@that, \@fail);

}

sub _minimum {
    my $ret = 2**53;
    foreach $in (@_) {
        $ret = $ret < $in ? $ret : $in;
    }
    $ret;
}

sub _levenshteindistance {
    my $s = shift;
    my $t = shift;
    my @s = @{ $s };
    my @t = @{ $t };

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i] = ();
    }

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i][0] = $i # deletion
    }
    for(my $j=0;$j<scalar(@t);$j++) {
        $d[0][$j] = $j # insertion
    }

    for(my $j=1;$j<scalar(@t);$j++) {
        for(my $i=1;$i<scalar(@s);$i++) {
            if ($s[$i] eq $t[$j]) {
                $d[$i][$j] = $d[$i-1][$j-1];
            } else {
                #                    deletion,      insertion,     substitution
                $d[$i][$j] = _minimum($d[$i-1][$j]+1,$d[$i][$j-1]+1,$d[$i-1][$j-1]+1);
            }
        }
    }

    foreach $a (@d) {
        @a = @{ $a };
        foreach $b (@a) {
            printf STDERR "%2d ",$b;
        }
        print STDERR "\n";
    }

    return $d[$#s][$#t];
}

Reed Debaets 2010-01-12 15:12:06

ansaurus

tags:

views:

answers:

find extra, missing, invalid strings when comparing two lists in perl

related questions