views:

122

answers:

5

I have a long list of words in an array. Some short, some long. I'd like to filter out those words which starts with a word from the array (length of this "prefix" word can be set to, say, 3 characters) and which at the same time ends with a word from it.

Let's say the first word is 'carport'. Now, if 'car' and 'port' exist in the array too I would get a match. But if the word is 'carlsberg' I wouldn't get a match (since 'lsberg' probably wouldn't be an existing word in the array).

Results would preferably come out as "prefix word, suffix word, whole word".

I'd consider using any language that can make me do this, although I'm mostly a JavaScript guy myself.

A: 

Well, the naive implementation in JavaScript would be like this:

function triples(words) { 
    var result = new Array();
    for(var i=0; i<words.length; i++) {
        for(var j=0; j<words.length; j++) {
            var k = words.indexOf(words[i] + words[j]);
            if(k != -1) {
                result.push([words[i], words[j], words[k]]);
            }
        }
    } 
    return result;
}

The function in its current form requires the array of all words as parameter and returns an array of arrays containing the found word triples (first element being the prefix, second element being the postfix, third element being the combined word).

jfs
A: 

Something like this:

#!/usr/bin/perl

use strict;
use warnings;

my @candidates=qw( carport Carsburg butterfly 
                buttercup Christmas wishlist carpface flyface buttface);
my @arr=<DATA>;
chomp @arr;

for my $i (3..6) {
    foreach my $j (@candidates) {
        my ($fp,$lp)=($1,$2) if ($j=~/(^.{$i})(.*$)/);
        if($fp && $lp) {
            my @hit1=grep(/^$fp/,@arr);
            my @hit2=grep(/$lp$/,@arr);
            print "candidate: $j\n start= @hit1 end= @hit2\n=====\n" 
                if (scalar @hit1 && scalar @hit2);
        }
    }
}

__DATA__
car
port
wish
list
Christ
mas
butter
cup
fly
face
butt

Output:

candidate: carport
 start= car end= port
=====
candidate: flyface
 start= fly end= face
=====
candidate: wishlist
 start= wish end= list
=====
candidate: buttface
 start= butter butt end= face
=====
candidate: butterfly
 start= butter end= fly
=====
candidate: buttercup
 start= butter end= cup
=====
candidate: Christmas
 start= Christ end= mas
drewk
Having "carport" (and other "combined" words) in that list gives me "end= carport port" though, but I think you're close to what I'm after. Maybe filtering out matches with multiple starts and ends? I'm thinking of using this filter on a large amount of text, maybe some kind of dictionary even, so I figure each start word must come up anyhow?
naton
I am not sure I understand. Are you saying you added "carport" to the list under `__DATA`? If you want this type of filtering based on a single list rather than two (the way I wrote it) it is slightly different logic.
drewk
+1  A: 

I wonder if a trie would help, see What is the most common use of the “trie” data structure?.

Perl has a couple modules to build them:

Something else that sounds kind of like it would be a starting place is Ruby's Abbrev module:

#!/usr/bin/env ruby

require 'abbrev'
require 'pp'

pp %w[car port carport carlsberg].abbrev
# >> {"por"=>"port",
# >>  "po"=>"port",
# >>  "p"=>"port",
# >>  "carpor"=>"carport",
# >>  "carpo"=>"carport",
# >>  "carp"=>"carport",
# >>  "carlsber"=>"carlsberg",
# >>  "carlsbe"=>"carlsberg",
# >>  "carlsb"=>"carlsberg",
# >>  "carls"=>"carlsberg",
# >>  "carl"=>"carlsberg",
# >>  "car"=>"car",
# >>  "port"=>"port",
# >>  "carport"=>"carport",
# >>  "carlsberg"=>"carlsberg"}
Greg
A: 

Here is a Perl solution that is O(n + 2m):

use warnings;
use strict;
use Data::Dumper;

my @words = qw(car carport carlsberg cartographer airport photographer);

my @ends  = qw(car port air grapher);

my $ends_re = join '|' => @ends;

my @matches = map {/^($ends_re).*($ends_re)$/ ? [$1, $_, $2] : ()} @words;

print Dumper \@matches;

prints:

$VAR1 = [
      [
        'car',
        'carport',
        'port'
      ],
      [
        'car',
        'cartographer',
        'grapher'
      ],
      [
        'air',
        'airport',
        'port'
      ]
    ];
Eric Strom
A: 

I would do something like:

<?php

    $words = array('experts', 'exchange', 'expert', 'sexchange');

    // build trie
    $t = array();
    foreach ($words as $word)
    {
        $n = &$t;
        for ($i = 0; $i < strlen($word); ++$i)
        {
            $c = $word[$i];

            if (!isset($n[$c])) $n[$c] = array();

            $n = &$n[$c];
        }

        $n['.'] = true;
    }

    $word = 'expertsexchange';

    $n = $t;
    for ($i = 0; $i < strlen($word); ++$i)
    {
        $c = $word[$i];

        if (isset($n['.']))
        {
            $o = $t;
            for ($j = $i; $j < strlen($word); ++$j)
            {
                $d = $word[$j];
                if (!isset($o[$d])) break;
                $o = $o[$d];                    
            }

            # found match
            if ($j == strlen($word) && isset($o['.']))
            {
                echo substr($word, 0, $i).",".substr($word,$i).",".$word."\n";
            }
        }

        if (isset($n[$c]))
        {
            $n = $n[$c];
        }
        else
            break;
    }
?>

Results:

expert,sexchange,expertsexchange
experts,exchange,expertsexchange

I wrote it on the spot, so it might not work exactly right. But the idea is to build a prefix tree and walk through it. Every time you find a prefix (indicated by coming to a '.'), continue again from the top of the tree to see if you can find a suffix from that point. This assumes you want nothing between the prefix and suffix.

konforce