views:

1605

answers:

9

I need to check a string to see if any word in it has multiple occurences. So basically I will accept:

"google makes love"

but I don't accept:

"google makes google love" or "google makes love love google" etc.

Any ideas? Really don't know any way to approach this, any help would be greatly appreciated.

+1  A: 
function Accept($str)
{
 $words = explode(" ", trim($str));
 $len = count($words);
 for ($i = 0; $i < $len; $i++)
 {
  for ($p = 0; $p < $len; $p++)
  {
   if ($p != $i && $words[$i] == $words[$p])
   {
    return false;
   }
  }
 }
 return true;
}

EDIT

Entire test script. Note, when printing "false" php just prints nothing but true is printed as "1".

<?php

    function Accept($str)
    {
            $words = explode(" ", trim($str));
            $len = count($words);
            for ($i = 0; $i < $len; $i++)
            {
                    for ($p = 0; $p < $len; $p++)
                    {
                            if ($p != $i && $words[$i] == $words[$p])
                            {
                                    return false;
                            }
                    }
            }
            return true;
    }

echo Accept("google makes love"), ", ", Accept("google makes google love"), ", ",
 Accept("google makes love love google"), ", ", Accept("babe health insurance babe");


?>

Prints the correct output:

1, , ,
nlaq
thanks but I can't get it to work :)
zuk1
Now it works. I had a typo. $words[$i] == $p should of been $words[$i] == $words[$p]... I just tested it myself :)
nlaq
code won't work for strings like"babe health insurance babe"
zuk1
..Erm, it seems to on my end... *posting my entire script*
nlaq
strange works for all except babe health insurance
zuk1
Dunno what your apache/PHP install has been smoking but that string works fine as well on my end...
nlaq
A: 

The simplest method is to loop through each word and check against all previous words for duplicates.

lc
No need to store two arrays. Look at my example :)
nlaq
It's not quite the simplest method; take a look at my answer.
The Wicked Flea
Good point. I'm way too tired to be answering questions at this hour. Of course you easily have the subarray of all previous words. Just look up to the current index - 1. Duh. I removed the store two arrays part.
lc
A: 

Regular expression with backreferencing

http://www.regular-expressions.info/php.html

http://www.regular-expressions.info/named.html

cmsjr
Now there would be two problems. Regex isn't necessary, that's like fixing a car's leaking gas-tank with a blowtorch (when full).
The Wicked Flea
+3  A: 

Try this:

function single_use_of_words($str) {
  $words = explode(' ', $str);
  $words = array_unique($words);
  return implode(' ', $words);
}
The Wicked Flea
A bit slower then my method. You would have to compare the count of both arrays to get the result that the OP is looking for. However, I have never heard of the array_unique function before so thanks for that :)
nlaq
I hadn't actually checked the speed, but it produces a result simply. ;)
The Wicked Flea
+2  A: 
<?php
$words = preg_split('\b', $string, PREG_SPLIT_NO_EMPTY);
$wordsUnique = array_unique($words);
if (count($words) != count($wordsUnique)) {
    echo 'Duplicate word found!';
}
?>
Wesley Mason
The use of the regex to split the words will slow this down yet further.
The Wicked Flea
Yeah, and I get the feeling reading some of the OP's posts that he intends to use this for searching the entire source of multiple HTML files during the same script. I think it would be important to look at performance in this application.
nlaq
I used the regex to split on word boundaries as opposed to just splitting on spaces, should have made this clear.
Wesley Mason
Yes, the preg_split will avoid $words containing null words if two spaces appear in the string.
jmucchiello
+5  A: 

Based on Wicked Flea code:

function single_use_of_words($str) {  
   $words = explode(' ', trim($str));  //Trim to prevent any extra blank
   if (count(array_unique($words)) == count($words)) {
      return true; //Same amount of words
   }   
   return false;
}
Veynom
Rather than trim/explode, use str_word_count($str, 2). It's a bit more liberal in terms of whitespace/trimming.
Ciaran McNulty
Whilst this will work, it might be inefficient for large strings; even if the first and second words are identical, array_unique will still visit every element in the array. You could loop through the elements, adding each to something like a binary search tree, terminating on the FIRST duplicate.
Bobby Jack
It also fails if there are two occurances of 2 or more spaces. There will be multiple nulls it he $words array that will get counted as only one word in the array_unique call.
jmucchiello
+1  A: 

This seems fairly fast. It would be interesting to see (for all the answers) how the memory usage and time taken increase as you increase the length of the input string.

function check($str) {
    //remove double spaces
    $c = 1;
    while ($c) $str = str_replace('  ', ' ', $str, $c);

    //split into array of words
    $words = explode(' ', $str);
    foreach ($words as $key => $word) {
        //remove current word from array
        unset($words[$key]);
        //if it still exists in the array it must be duplicated
        if (in_array($word, $words)) {
            return false;
        }
    }
    return true;
}

Edit

Fixed issue with multiple spaces. I'm not sure whether it is better to remove these at the start (as I have) or check each word is non-empty in the foreach.

Tom Haigh
I would be interested as well.
nlaq
Memory usage shouldn't be too bad, but it has O(n^2) time because in_array is a linear search. This also suffers the double space problem.
jmucchiello
+1  A: 

No need for loops or arrays:

<?php

$needle = 'cat';
$haystack = 'cat in the cat hat';

if ( occursMoreThanOnce($haystack, $needle) ) {
    echo 'Success'; 
} 

function occursMoreThanOnce($haystack, $needle) {
    return strpos($haystack, $needle) !== strrpos($haystack, $needle);
}

?>
Kevin
+1  A: 

The regular expression way would definitely be my choice.

I did a little test on a string of 320 words with Veynom's function and a regular expression

function preg( $txt ) {
    return !preg_match( '/\b(\w+)\b.*?\1/', $txt );
}

Here's the test

$time['preg'] = microtime( true );

for( $i = 0; $i < 1000; $i++ ) {
    preg( $txt );
}

$time['preg'] = microtime( true ) - $time['preg'];


$time['veynom-thewickedflea'] = microtime( true );

for( $i = 0; $i < 1000; $i++ ) {
    single_use_of_words( $txt );
}

$time['veynom-thewickedflea'] = microtime( true ) - $time['veynom-thewickedflea'];

print_r( $time );

And here's the result I got

Array
(
    [preg] => 0.197616815567
    [veynom-thewickedflea] => 0.487532138824
)

Which suggests that the RegExp solution, as well as being a lot more concise is more than twice as fast. ( for a string of 320 words anr 1000 iterations )

When I run the test over 10 000 iterations I get

Array
(
    [preg] => 1.51235699654
    [veynom-thewickedflea] => 4.99487900734
)

The non RegExp solution also uses a lot more memory.

So.. Regular Expressions for me cos they've got a full tank of gas

EDIT
The text I tested against has duplicate words, If it doesn't, the results may be different. I'll post another set of results.

Update
With the duplicates stripped out ( now 186 words ) the results for 1000 iterations is:

Array
(
    [preg] => 0.235826015472
    [veynom-thewickedflea] => 0.2528860569
)

About evens

meouw
The only problem here is the array_unique method will call "hello" and "hello?" two different word while preg_split using \w+ will consider those to both equal "hello". To duplicate the array_unique method you need ([-\b]+) in place of (\w+). That could change your speed.
jmucchiello
Is that not another argument for the regexp aproach? hello and hello? are the same word surely.
meouw