tags:

views:

69

answers:

4
$string = 'I like banana, banana souffle, chocobanana and marshmellows.";
$arr = some_function($string); 
// $arr = ('banana'=>3,'I'=>1,'like'=>1....);

do you have an idea how to do this most efficiently?

+2  A: 

you can use array_count_values

eg

$string = 'I like banana, banana souffle, chocobanana and marshmellows';
$s = preg_split("/[, ]+/",$string);
print_r(array_count_values($s));

note: this only count whole words. ie "banana" will be 2 , not 3 because chocobanana is not the same as banana. If you want to search and count for words within words, extra coding is necessary

ghostdog74
so If I indeed want to count "chocobanana", i need to use explode and loop thru the string with preg_match_all basically? no other way around that?
sombe
its quite complicated. you need to know how to break up choco and banana. For example, you have "I" , but if you also have say, "Immediate", then how are you going to count "I" ? you need a way to separate out those words that have meaning ....
ghostdog74
+4  A: 
$str = 'I like banana, banana souffle, chocobanana and marshmellows.';
$words = str_word_count($str, 1);
$freq = array();
foreach ($words as $w) {
  if (preg_match_all('/' . preg_quote($w, '/') . '/', $str, $m)) {
    $freq[$w] = count($m[0]);
  }
}
print_r($freq);
jspcal
Ha! I knew str_word_count was useful for this :)
Gordon
I've checked it out, works well so far. do you have an idea how to sort the array by occurrence?
sombe
@Gal `asort` sorts an array and maintains index association
Gordon
so i take it that want you want is, if you have say, "altercation alternation" , then the word "alter" doesn't count right? "alter" is a sub-word within these 2 words "altercation" and "alteration".
ghostdog74
A: 
preg_match_all('!\b\w+\b!', $string, $matches);
$arr = array_count_values($matches[0]);
print_r($arr);
cletus
A: 

Because you want to count partial words, you will need a wordlist with possible words. Then you split up the text in words based on space separation at first, loop through all words and try to find the longest possible substring match against the wordlist. This will of course be really, really slow if the wordlist is big, but maybe you can speed up the matching by using a suffix array of the word you are searching through.

If you don't find a matching substring, just count the whole word as one.

I hope you understand my idea. It's not that great, but it's the solution I can come up with for your requirements.

Emil Vikström