tags:

views:

140

answers:

4

I have a list of words in which some are composed words, in example

  • palanca
  • plato
  • platopalanca

I need to remove "plato" and "palanca" and let only "platopalanca". Used array_unique to remove duplicates, but those composed words are tricky...

Should I sort the list by word length and compare one by one? A regular expression is the answer?

update: The list of words is much bigger and mixed, not only related words

update 2: I can safely implode the array into a string.

update 3: I'm trying to avoid doing this as if this was a bobble sort. there must be a more effective way of doing this

Well, I think that a buble-sort like approach is the only possible one :-( I don't like it, but it's what i have... Any better approach?

function sortByLengthDesc($a,$b){
return strlen($a)-strlen($b);
}

usort($words,'sortByLengthDesc');
$count = count($words);
for($i=0;$i<=$count;$i++) {
    for($j=$i+1;$j<$count;$j++) {
     if(strstr($words[$j], $words[$i]) ){
      $delete[]=$i;
     }
    }
}
foreach($delete as $i) {
    unset($words[$i]);
}

update 5: Sorry all. I'm A moron. Jonathan Swift make me realize I was asking the wrong question. Given x words which START the same, I need to remove the shortests ones.

  • "hot, dog, stand, hotdogstand" should become "dog, stand, hotdogstand"
  • "car, pet, carpet" should become "pet, carpet"
  • "palanca, plato, platopalanca" should become "palanca, platopalanca"
  • "platoother, other" should be untouchedm they both start different
A: 

You can take each word and see, if any word in array starts with it or ends with it. If yes - this word should be removed (unset()).

FractalizeR
A: 

Regex could work. You can define within the regex where the start and end of the string applies.

^ defines the start $ defines the end

so something like

foreach($array as $value)
{
    //$term is the value that you want to remove
    if(preg_match('/^' . $term . '$/', $value))
    {
        //Here you can be confident that $term is $value, and then either remove it from
        //$array, or you can add all not-matched values to a new result array
    }
}

would avoid your issue

But if you are just checking that two values are equal, == will work just as well as (and possibly faster than) preg_match

In the event that the list of $terms and $values are huge this won't come out to be the most efficient of strategies, but it is a simple solution.

If performance is an issue, sorting (note the provided sort function) the lists and then iterating down the lists side by side might be more useful. I'm going to actually test that idea before I post the code here.

A: 

You could put the words into an array, sort the array alphabetically and then loop through it checking if the next words start with the current index, thus being composed words. If they do, you can remove the word in the current index and the latter parts of the next words...

Something like this:

$array = array('palanca', 'plato', 'platopalanca');
// ok, the example array is already sorted alphabetically, but anyway...
sort($array);

// another array for words to be removed
$removearray = array();

// loop through the array, the last index won't have to be checked
for ($i = 0; $i < count($array) - 1; $i++) {

  $current = $array[$i];

  // use another loop in case there are more than one combined words
  // if the words are case sensitive, use strpos() instead to compare
  while ($i < count($array) && stripos($array[$i + 1], $current) === 0) {
    // the next word starts with the current one, so remove current
    $removearray[] = $current;
    // get the other word to remove
    $removearray[] = substr($next, strlen($current));
    $i++;
  }

}

// now just get rid of the words to be removed
// for example by joining the arrays and getting the unique words
$result = array_unique(array_merge($array, $removearray));
kkyy
Why the downvote?
kkyy
+2  A: 

I think you need to define the problem a little more, so that we can give a solid answer. Here are some pathological lists. Which items should get removed?:

  • hot, dog, hotdogstand.
  • hot, dog, stand, hotdogstand
  • hot, dogs, stand, hotdogstand

SOME CODE

This code should be more efficient than the one you have:

$words = array('hatstand','hat','stand','hot','dog','cat','hotdogstand','catbasket');

$count = count($words);

for ($i=0; $i<=$count; $i++) {
 if (isset($words[$i])) {
  $len_i = strlen($words[$i]);
  for ($j=$i+1; $j<$count; $j++) {
   if (isset($words[$j])) {
    $len_j = strlen($words[$j]);

    if ($len_i<=$len_j) {
     if (substr($words[$j],0,$len_i)==$words[$i]) {
      unset($words[$i]); 
     }
    } else {
     if (substr($words[$i],0,$len_j)==$words[$j]) {
      unset($words[$j]);
     }
    }
   }
  }
 }
}

foreach ($words as $word) {
 echo "$word<br>";
}

You could optimise this by storing word lengths in an array before the loops.

Jonathan Swift
I already took care of plural forms.I'm updating my question. You make me realize I was taking the wrong approach +1.
The Disintegrator