views:

882

answers:

6

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.

Example:

  • file:///home/gms8994/Music/t.A.T.u./
  • file:///home/gms8994/Music/nina%20sky/
  • file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's similar between all 3 strings.

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.

From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

+3  A: 

It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

Brett Daniel
Question is not about substring but prefix. Substring finding algorithms are more complex and inefficient for requested problem.
Hynek -Pichi- Vychodil
+2  A: 

If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings. Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.

Mathematica happens to have a function for this built in: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)

If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

dreeves
Question is not about substring but prefix. Substring finding algorithms are more complex and inefficient for requested problem.
Hynek -Pichi- Vychodil
True, I just thought it's good to give the most general answer for those searching for this later. I'll edit my answer to point out that longest common prefix is much faster if that's all you need.
dreeves
+2  A: 

My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.

In Perl, you'll have to split up the string first into characters using something like

@array = split(//, $string);

(splitting on an empty character sets each character into its own element of the array)

Then do a loop, perhaps overall:

$n =0;
@array1 = split(//, $string1);
@array2 = split(//, $string2);
@array3 = split(//, $string3);

while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
 $n++; 
}

$sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

Perchik
+4  A: 

The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.

Here's the one I use for longest common prefix (and a ref to original URL):

use strict; use warnings;
sub longest_common_prefix {
    # longest_common_prefix( $|@ ): returns $
    # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
    # find longest common prefix of scalar list
    my $prefix = shift;
    for (@_) {
     chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

my @str = map {chomp; $_} <DATA>;
print longest_common_prefix(@ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).

#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';

use strict; use warnings;

sub LCS{
 my @str = @_;
 my @pos;
 for my $i (0 .. $#str) {
  my $line = $str[$i];
  for (0 .. length($line) - 1) {
   my $char= substr($line, $_, 1);
   push @{$pos[$i]{$char}}, $_;
  }
 }
 my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;
 my %map;
 CHAR:
 for my $char (split //, $sh_str) {
  my @loop;
  for (0 .. $#pos) {
   next CHAR if ! $pos[$_]{$char};
   push @loop, $pos[$_]{$char};
  }
  my $next = NestedLoops([@loop]);
  while (my @char_map = $next->()) {
   my $key = join '-', @char_map;
   $map{$key} = $char;
  }
 }
 my @pile;
 for my $seq (keys %map) {
  push @pile, $map{$seq};
  for (1 .. 2) {
   my $dir = $_ % 2 ? 1 : -1;
   my @offset = split /-/, $seq;
   $_ += $dir for @offset;
   my $next = join '-', @offset;
   while (exists $map{$next}) {
    $pile[-1] = $dir > 0 ?
     $pile[-1] . $map{$next} : $map{$next} . $pile[-1];
    $_ += $dir for @offset;
    $next = join '-', @offset;
   }
  }
 }
 return reduce {length($a) > length($b) ? $a : $b} @pile;
}

my @str = map {chomp; $_} <DATA>;
print LCS(@str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Roy
+4  A: 

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

Perl can be fast:

use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
     $min_length = length($_) if length($_) < $min_length;
    }
INDEX: foreach my $ch ( split //, $first ) {
     last INDEX unless $i < $min_length;
     foreach my $string (@_) {
      last INDEX if substr($string, $i, 1) ne $ch;
     }
    }
    continue { $i++ }
    return substr $first, 0, $i;
}

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

1;

Test suite:

#!/usr/bin/env perl

use strict;
use warnings;

Test::LCP->runtests;

package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {
    use_ok('LCP');
}

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
     'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
     'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
     'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}

__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

Hynek -Pichi- Vychodil
+1 for adding a test suite including cases such as an empty string; -1 for caring about performance tuning an algorithm implemented in a scripting language. Net score: 0.
j_random_hacker
And -999 for mystification ;-(
Hynek -Pichi- Vychodil
A fine analysis, but I would be wary of encouraging premature optimization. For new coders especially, code clarity is far more important. And the implied small scale use in this question likely receives no benefit from any optimization.
Roy
+1  A: 

From http://forums.macosxhints.com/showthread.php?t=33780

my @strings =
    (
      'file:///home/gms8994/Music/t.A.T.u./',
      'file:///home/gms8994/Music/nina%20sky/',
      'file:///home/gms8994/Music/A%20Perfect%20Circle/',
    );

my $common_part = undef;
my $sep = chr(0);  # assuming it's not used legitimately
foreach my $str ( @strings ) {

    # First time through loop -- set common
    # to whole
    if ( !defined $common_part ) {
        $common_part = $str;
        next;
    }

    if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
    {
        $common_part = $1;
    }
}

print "Common part = $common_part\n";
Hissohathair
It may not matter for the string lengths you're working with, but for longer strings this will be very slow. Even if Perl can optimise the final ".*$" out of your regex, each loop iteration will take O(n^2) time in the length of $str to find the right way to match the initial ".*.*".
j_random_hacker
Using .*$ looks useless for me. This solution works fine and is almost same fast as mine.
Hynek -Pichi- Vychodil