How can I extract the common words between two or more paragraphs in php5? I guess it might work to summarize each text to create a list of highly ranked words and then compare them. Any suggestions or help would highly appreciated.
+4
A:
There is probably a faster way but you could regex out punctuation like !?-./\@#$%^&*, then explode the two paragraphs into an array, and then try array_intersect() on both arrays. Anything in array 2 that is in array 1 should come back as a match.
http://php.net/manual/en/function.array-intersect.php
Theoretically you should receive back an array of matching words. From there, ranking is up to you and how you chose to do it.
Kevin
2010-03-22 17:16:00
+1 Beat me to my answer, though I might have used `str_replace` to deal with the punctuation.
Isaac
2010-03-22 17:18:44
A:
- Split each paragraph on spaces
- Select a token from paragraph A; if it is in paragraph B, put it in a 'matches' array.
- Repeat step 2 until there are no more tokens in paragraph A.
Allyn
2010-03-22 17:16:19
This seems like an odd answer to have garnered a downvote without comment. Nothing obviously incorrect about the theory.
Beska
2010-03-22 17:20:12
I didn't downvote it, but I expect it's because doing it this way seems so clunky. It would work, but using array functions makes way more sense. Also, nothing was mentioned about punctuation, which will cause a problem if it's not stripped.
Syntax Error
2010-03-22 17:55:31
+5
A:
I guess the most basic way would be to :
- split each paragraph into an array of words, using either
explode
orpreg_split
- the first one might be a bit faster
- the second one might provide a bit more options
- maybe, do some filtering on the list of words :
- clean each word
- removing special characters, like accented letters
- converting everything to lower/upper-case, to help the comparisons you'll be doing later
- remove too common words
- remove too short words
array_filter
, here, could probably help
- clean each word
- and then, get the list of words that are in both arrays, using something like
array_intersect
Pascal MARTIN
2010-03-22 17:16:56
that's a great method, for the filtering, a more accurate (but more complex) approach is to reduce weight on words base on their frequencies in a large corpus. for exemple the word 'the' have a high frequency so its ranking will be greatly reduce. words with the higher rank are then more representative.
mathroc
2010-03-22 17:21:59
@mathroc : true ; and, with a bit of tweaking, this could also allow one to inject high weight for some specific words
Pascal MARTIN
2010-03-22 17:51:32
Another twist could be to use http://tartarus.org/~martin/PorterStemmer/ as part of this.
chris
2010-03-22 19:12:09
+2
A:
Something like this might work...
<?php
$paragraph = "hello this is some sample text. Sample text is usually used to test a program. For example, this sample text will be used to test the script below.";
$words = array();
preg_match_all('/\w+/', $paragraph, $matches);
foreach($matches[0] as $w){
$w = strtolower($w);
if(!array_key_exists($w, $words)){
$words[$w] = 0;
}
$words[$w]++;
}
asort($words);
echo print_r($words, true);
/* Output
Array (
[hello] => 1
[will] => 1
[example] => 1
[a] => 1
[program] => 1
[usually] => 1
[Sample] => 1
[script] => 1
[below] => 1
[some] => 1
[the] => 1
[be] => 1
[for] => 1
[to] => 2
[is] => 2
[sample] => 2
[test] => 2
[used] => 2
[this] => 2
[text] => 3
) */
?>
macek
2010-03-22 17:17:35
+2
A:
<?php
/**
* Gets all the words as an array for a given text blob
*
* @param string $paragraph The pragraph in question
* @return string[] Words found
*/
function getWords($paragraph) {
//only lowercase
$paragraph = strtolower($paragraph);
//replace all non alpha num characters with spaces (this way periods won't screw
//with our words)
$paragraph = preg_replace("/[^a-z]/", " ", $paragraph);
$paragraph = explode(" ", $paragraph);
//get rid of empty words
$paragraph = array_flip($paragraph);
unset($paragraph[""]);
$paragraph = array_flip($paragraph);
return $paragraph;
}
$paragraph1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque sit amet ante
nisl. Morbi tempor varius semper. Suspendisse vel nisi dui. Sed tristique consectetur imperdiet.
Morbi nulla diam, lobortis non eleifend eget, ullamcorper nec tortor. Duis quis lectus felis.
In vulputate varius luctus. Maecenas gravida laoreet massa quis faucibus. Duis dictum, dui sit
amet pharetra laoreet, tortor nisi mattis tortor, et ornare purus dolor vitae ligula. Sed id
orci ut dolor fermentum imperdiet. Nulla non justo urna, in suscipit nunc. Donec ut nibh risus,
ut tempus mi. Proin fringilla pretium urna sed faucibus. Proin et porttitor sem. Nulla eros
arcu, sodales et aliquam in, pharetra et mauris. Duis placerat blandit justo at tincidunt.
Etiam eu rutrum arcu.";
$paragraph2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sit amet leo id
arcu feugiat tempus quis a risus. Proin non nisi augue. Cras ultricies dignissim augue vel gravida.
Vivamus sed orci sed leo sollicitudin aliquet non at dui. Nulla facilisi. Suspendisse nunc nibh,
sollicitudin vitae tincidunt eget, aliquet vitae magna. Aliquam vehicula cursus ante, vitae rhoncus
orci egestas et. Fusce condimentum metus at metus auctor pellentesque. Suspendisse potenti. Morbi
blandit, leo sed eleifend pretium, augue dui interdum eros, vel faucibus felis dolor id elit. Nam
condimentum, odio at mattis consequat, sem eros molestie risus, a tempus dolor arcu sit amet justo.";
$common = array_intersect(getWords($paragraph1), getWords($paragraph2));
sort($common);
var_dump($common);
?>
Luke Magill
2010-03-22 17:42:42