views:

44

answers:

3

I need to do two things, first, find a given text which are the most used word and word sequences (limited to n). Example:

Lorem *ipsum* dolor sit amet, consectetur adipiscing elit. Nunc auctor urna sed urna mattis nec interdum magna ullamcorper. Donec ut lorem eros, id rhoncus nisl. Praesent sodales lorem vitae sapien volutpat et accumsan lorem viverra. Proin lectus elit, cursus ut feugiat ut, porta sit amet leo. Cras est nisl, aliquet quis lobortis sit amet, viverra non erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Integer euismod scelerisque quam, et aliquet nibh dignissim at. Pellentesque ut elit neque. Etiam facilisis nisl eu mauris luctus in consequat libero volutpat. Pellentesque auctor, justo in suscipit mollis, erat justo sollicitudin ipsum, in cursus erat ipsum id turpis. In tincidunt hendrerit scelerisque.

(some words my have been omited, but it's an example).

I'd like to result with sit amet and not sit and amet

Any ideas on how to start?

Second, I need to wrap all the words or word sequences matched from a given list in a given file.

For this, I think to order the result by desceding length and then process each string in replace function, to avoid having sit amet wrapped if I have another sit word in my list. Is it a good way to do?!

Thank you

A: 

I tried to solve the first part some time ago, see here:

http://corexii.com/freqwordseq/

Example on Lorem Ipsum (not yours, but one of 'em):

http://corexii.com/freqwordseq/?file=loremipsum&minfreq=2&minseq=1&maxseq=4

It's pretty slow, but it's a start. What you wanna do is weigh the matches so that the more words in the match, the higher weight, to make sequences more important than the individual words that make up those sequences. And then you probably want to optimize the routine.

Core Xii
Works like a charm, thank you a lot for providing your script.
John
A: 

This is a functional solution that could still use some cleaning up. My general algorithm is this:

  1. Explode all words into a list w, stripping excess whitespace and punctuation
  2. Find the array of all n-length chunks of w starting at offset 0
  3. Find the array of all n-length chunks of w starting at offset 1
    • ... continue until you've found the array of n-length chunks starting at offset n-1
    • Note: if the last chunk of w is not n-length, do not include it as part of the chunk array
  4. Concatenate all chunk arrays as c
  5. Find the frequency of every value in c

$sample = 'Lorem *ipsum* dolor sit amet, consectetur adipiscing elit. Nunc auctor urna sed urna mattis nec interdum magna ullamcorper. Donec ut lorem eros, id rhoncus nisl. Praesent sodales lorem vitae sapien volutpat et accumsan lorem viverra. Proin lectus elit, cursus ut feugiat ut, porta sit amet leo. Cras est nisl, aliquet quis lobortis sit amet, viverra non erat. Vestibulum ante ipsum  primis in faucibus orci luctus et ultrices posuere cubilia Curae; Integer euismod scelerisque quam, et aliquet nibh dignissim at. Pellentesque ut elit neque. Etiam facilisis nisl eu mauris luctus in consequat libero volutpat. Pellentesque auctor, justo in suscipit mollis, erat justo sollicitudin ipsum, in cursus erat ipsum id turpis. In tincidunt hendrerit scelerisque.';

function buildPhrases($string, $length) {

    $onlyWords = preg_replace('/\p{P}/', '', $string);
    $wordArray = preg_split('/\s+/s', $onlyWords);

    function buildPhraseChunks($wordArray, $length, $offset = 0)    
    {
        if ($offset >= $length) {
            return array();
        } else {
            $offsetWordArray = array_slice($wordArray, $offset);
            return array_merge(
                array_chunk($offsetWordArray, $length),             
                buildPhraseChunks(
                    $wordArray, $length, $offset + 1
                )
            );
        }
    }

    $onlyLengthN = function ($n) {
        return function($a) use ($n) {
            return count($a) == $n;
        };
    };

    $concatWords = function ($a, $b) {
        return $a . ' ' . $b;
    };

    $reduce = function ($a) use ($concatWords) {
        return array_reduce($a, $concatWords);
    };

    $format = function ($a) {
        return strtolower(trim($a));
    };

    $chunks = array_filter(
        buildPhraseChunks($wordArray, $length),
        $onlyLengthN($length)
    );
    $phrases = array_map($reduce, $chunks);
    $formattedPhrases = array_map($format, $phrases);

    return $formattedPhrases;

}

$phrases = buildPhrases($sample, 1);
$dropOnes = function($a) {
    return $a != 1;
};
$freqCount = array_filter(
    array_count_values($phrases),
    $dropOnes
);

arsort($freqCount);

print_r($freqCount);
erisco