tags:

views:

131

answers:

3

take this string as an example: "will see you in London tomorrow and Kent the day after tomorrow".

How would I convert this to an associative array that contains the keywords as keys, whilst preferably missing out the common words, like this:

Array ( [tomorrow] => 2 [London] => 1 [Kent] => 1)

Any help greatly appreciated.

+7  A: 

I would say you could :

  • split the string into an array of words
    • with explode
    • or preg_split
    • depending on the complexity you'll accept for your words separators
  • use array_filter to only keep the lines (i.e. words) you want
    • the callback function will have to return false for all non-valid-words
  • and, then, use array_count_values on the resulting list of words
    • which will count how many times each words is present in the array of words



EDIT : and, just for fun, here's a quick example :

First of all, the string, that gets exploded into words :

$str = "will see you in London tomorrow and Kent the day after tomorrow";
$words = preg_split('/\s+/', $str, -1, PREG_SPLIT_NO_EMPTY);
var_dump($words);

Which gets you :

array
  0 => string 'will' (length=4)
  1 => string 'see' (length=3)
  2 => string 'you' (length=3)
  3 => string 'in' (length=2)
  4 => string 'London' (length=6)
  5 => string 'tomorrow' (length=8)
  6 => string 'and' (length=3)
  7 => string 'Kent' (length=4)
  8 => string 'the' (length=3)
  9 => string 'day' (length=3)
  10 => string 'after' (length=5)
  11 => string 'tomorrow' (length=8)


Then, the filteting :

function filter_words($word) {
    // a pretty simple filter ^^
    if (strlen($word) >= 5) {
        return true;
    } else {
        return false;
    }
}
$words_filtered = array_filter($words, 'filter_words');
var_dump($words_filtered);

Which outputs :

array
  4 => string 'London' (length=6)
  5 => string 'tomorrow' (length=8)
  10 => string 'after' (length=5)
  11 => string 'tomorrow' (length=8)


And, finally, the counting :

$counts = array_count_values($words_filtered);
var_dump($counts);

And the final result :

array
  'London' => int 1
  'tomorrow' => int 2
  'after' => int 1


Now, up to you to build up from here ;-)
Mainly, you'll have to work on :

  • A better exploding function, that deals with ponctuation (or deal with that during filtering)
  • An "intelligent" filtering function, that suits your needs better than mine

Have fun !

Pascal MARTIN
to fast for me. scooped again.
dnagirl
`str_word_count` might also be interesting: http://www.php.net/manual/en/function.str-word-count.php
Felix Kling
Thanks that works. is it possible to get the final result without the "int"? i.e. just the number on its own
Steven
@Steven : yes, yes, of course it's possible :: those "`int`", "`string`", and stuff like that in the output I presented are there because I used `var_dump`, which is great for **inspecting variables** -- but not quite when it comes to **displaying them to user** ;-) ;;; it's just a matter of displaying the data with something else than `var_dump`.
Pascal MARTIN
i feel like filtering by word length would be troublesome. It could easily get rid of valid words i.e. honda, jeep, paris. really depends on what youre using this for as to what method you should choose.
Galen
Galen's solution below works well, except when there is an apostrophe. How would I fix that. (thanks again)
Steven
Yes, filtering by words length is not a great idea : it just used this as a quick filter for my example. ;;; as I implied, in your application, you'll have to use something more intelligent ;-) ;;; maybe using a white-list *(which you'll spend your time updating)*, or a black-list *(same thing)*, or calculating some statistics on the fly, ... ;;; about quotes, well, you'll firt have to determine what constitutes a "word", and, then, adapt the regex used by `preg_split`.
Pascal MARTIN
I fixed the apostrophe problem by enclosing the string in double quotes, rather than single
Steven
+1  A: 

You could have a table of common words, then go through your string one word at a time, checking if it exists in the table, if not, then add it to your associative array, or +1 to it if it already exists.

Francisco Soto
A: 

using a blacklist of words not to be included

$str = 'will see you in London tomorrow and Kent the day after tomorrow';
$skip_words = array( 'in', 'the', 'will', 'see', 'and', 'day', 'you', 'after' );
// get words in sentence that aren't to be skipped and count their values
$words = array_count_values( array_diff( explode( ' ', $str ), $skip_words ) );

print_r( $words );
Galen