tags:

views:

227

answers:

5

How can I extract the common words between two or more paragraphs in php5? I guess it might work to summarize each text to create a list of highly ranked words and then compare them. Any suggestions or help would highly appreciated.

+4  A: 

There is probably a faster way but you could regex out punctuation like !?-./\@#$%^&*, then explode the two paragraphs into an array, and then try array_intersect() on both arrays. Anything in array 2 that is in array 1 should come back as a match.

http://php.net/manual/en/function.array-intersect.php

Theoretically you should receive back an array of matching words. From there, ranking is up to you and how you chose to do it.

Kevin
+1 Beat me to my answer, though I might have used `str_replace` to deal with the punctuation.
Isaac
A: 
  1. Split each paragraph on spaces
  2. Select a token from paragraph A; if it is in paragraph B, put it in a 'matches' array.
  3. Repeat step 2 until there are no more tokens in paragraph A.
Allyn
This seems like an odd answer to have garnered a downvote without comment. Nothing obviously incorrect about the theory.
Beska
I didn't downvote it, but I expect it's because doing it this way seems so clunky. It would work, but using array functions makes way more sense. Also, nothing was mentioned about punctuation, which will cause a problem if it's not stripped.
Syntax Error
+5  A: 

I guess the most basic way would be to :

  • split each paragraph into an array of words, using either explode or preg_split
    • the first one might be a bit faster
    • the second one might provide a bit more options
  • maybe, do some filtering on the list of words :
    • clean each word
      • removing special characters, like accented letters
      • converting everything to lower/upper-case, to help the comparisons you'll be doing later
    • remove too common words
    • remove too short words
    • array_filter, here, could probably help
  • and then, get the list of words that are in both arrays, using something like array_intersect
Pascal MARTIN
that's a great method, for the filtering, a more accurate (but more complex) approach is to reduce weight on words base on their frequencies in a large corpus. for exemple the word 'the' have a high frequency so its ranking will be greatly reduce. words with the higher rank are then more representative.
mathroc
@mathroc : true ; and, with a bit of tweaking, this could also allow one to inject high weight for some specific words
Pascal MARTIN
Another twist could be to use http://tartarus.org/~martin/PorterStemmer/ as part of this.
chris
I liked your answer , pretty smart
tawfekov
+2  A: 

Something like this might work...

<?php
  $paragraph = "hello this is some sample text. Sample text is usually used to test a program. For example, this sample text will be used to test the script below.";
  $words = array();
  preg_match_all('/\w+/', $paragraph, $matches);
  foreach($matches[0] as $w){
    $w = strtolower($w);
    if(!array_key_exists($w, $words)){
      $words[$w] = 0;
    }
    $words[$w]++;
  }
  asort($words);
  echo print_r($words, true);

  /* Output
  Array (
      [hello] => 1
      [will] => 1
      [example] => 1
      [a] => 1
      [program] => 1
      [usually] => 1
      [Sample] => 1
      [script] => 1
      [below] => 1
      [some] => 1
      [the] => 1
      [be] => 1
      [for] => 1
      [to] => 2
      [is] => 2
      [sample] => 2
      [test] => 2
      [used] => 2
      [this] => 2
      [text] => 3
  ) */

?>
macek
+2  A: 
<?php
/**
 * Gets all the words as an array for a given text blob
 *
 * @param string $paragraph The pragraph in question
 * @return string[] Words found
 */
function getWords($paragraph) {
   //only lowercase
   $paragraph = strtolower($paragraph);
   //replace all non alpha num characters with spaces (this way periods won't screw
   //with our words)
   $paragraph = preg_replace("/[^a-z]/", " ", $paragraph);
   $paragraph = explode(" ", $paragraph);
   //get rid of empty words
   $paragraph = array_flip($paragraph);
   unset($paragraph[""]);
   $paragraph = array_flip($paragraph);
   return $paragraph;
}

$paragraph1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque sit amet ante
nisl. Morbi tempor varius semper. Suspendisse vel nisi dui. Sed tristique consectetur imperdiet.
Morbi nulla diam, lobortis non eleifend eget, ullamcorper nec tortor. Duis quis lectus felis.
In vulputate varius luctus. Maecenas gravida laoreet massa quis faucibus. Duis dictum, dui sit
amet pharetra laoreet, tortor nisi mattis tortor, et ornare purus dolor vitae ligula. Sed id
orci ut dolor fermentum imperdiet. Nulla non justo urna, in suscipit nunc. Donec ut nibh risus,
ut tempus mi. Proin fringilla pretium urna sed faucibus. Proin et porttitor sem. Nulla eros
arcu, sodales et aliquam in, pharetra et mauris. Duis placerat blandit justo at tincidunt.
Etiam eu rutrum arcu.";

$paragraph2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sit amet leo id
arcu feugiat tempus quis a risus. Proin non nisi augue. Cras ultricies dignissim augue vel gravida.
Vivamus sed orci sed leo sollicitudin aliquet non at dui. Nulla facilisi. Suspendisse nunc nibh,
sollicitudin vitae tincidunt eget, aliquet vitae magna. Aliquam vehicula cursus ante, vitae rhoncus
orci egestas et. Fusce condimentum metus at metus auctor pellentesque. Suspendisse potenti. Morbi
blandit, leo sed eleifend pretium, augue dui interdum eros, vel faucibus felis dolor id elit. Nam
condimentum, odio at mattis consequat, sem eros molestie risus, a tempus dolor arcu sit amet justo.";

$common = array_intersect(getWords($paragraph1), getWords($paragraph2));
sort($common);
var_dump($common);
?>
Luke Magill