ansaurus

It's pretty slow, but it's a start. What you wanna do is weigh the matches so that the more words in the match, the higher weight, to make sequences more important than the individual words that make up those sequences. And then you probably want to optimize the routine.

Core Xii 2010-10-12 16:16:51

Works like a charm, thank you a lot for providing your script.

John 2010-10-12 17:12:51

Answer 3

A:

This is a functional solution that could still use some cleaning up. My general algorithm is this:

Explode all words into a list w, stripping excess whitespace and punctuation
Find the array of all n-length chunks of w starting at offset 0
Find the array of all n-length chunks of w starting at offset 1
- ... continue until you've found the array of n-length chunks starting at offset n-1
- Note: if the last chunk of w is not n-length, do not include it as part of the chunk array
Concatenate all chunk arrays as c
Find the frequency of every value in c

$sample = 'Lorem *ipsum* dolor sit amet, consectetur adipiscing elit. Nunc auctor urna sed urna mattis nec interdum magna ullamcorper. Donec ut lorem eros, id rhoncus nisl. Praesent sodales lorem vitae sapien volutpat et accumsan lorem viverra. Proin lectus elit, cursus ut feugiat ut, porta sit amet leo. Cras est nisl, aliquet quis lobortis sit amet, viverra non erat. Vestibulum ante ipsum  primis in faucibus orci luctus et ultrices posuere cubilia Curae; Integer euismod scelerisque quam, et aliquet nibh dignissim at. Pellentesque ut elit neque. Etiam facilisis nisl eu mauris luctus in consequat libero volutpat. Pellentesque auctor, justo in suscipit mollis, erat justo sollicitudin ipsum, in cursus erat ipsum id turpis. In tincidunt hendrerit scelerisque.';

function buildPhrases($string, $length) {

    $onlyWords = preg_replace('/\p{P}/', '', $string);
    $wordArray = preg_split('/\s+/s', $onlyWords);

    function buildPhraseChunks($wordArray, $length, $offset = 0)    
    {
        if ($offset >= $length) {
            return array();
        } else {
            $offsetWordArray = array_slice($wordArray, $offset);
            return array_merge(
                array_chunk($offsetWordArray, $length),             
                buildPhraseChunks(
                    $wordArray, $length, $offset + 1
                )
            );
        }
    }

    $onlyLengthN = function ($n) {
        return function($a) use ($n) {
            return count($a) == $n;
        };
    };

    $concatWords = function ($a, $b) {
        return $a . ' ' . $b;
    };

    $reduce = function ($a) use ($concatWords) {
        return array_reduce($a, $concatWords);
    };

    $format = function ($a) {
        return strtolower(trim($a));
    };

    $chunks = array_filter(
        buildPhraseChunks($wordArray, $length),
        $onlyLengthN($length)
    );
    $phrases = array_map($reduce, $chunks);
    $formattedPhrases = array_map($format, $phrases);

    return $formattedPhrases;

}

$phrases = buildPhrases($sample, 1);
$dropOnes = function($a) {
    return $a != 1;
};
$freqCount = array_filter(
    array_count_values($phrases),
    $dropOnes
);

arsort($freqCount);

print_r($freqCount);

erisco 2010-10-12 16:43:44

ansaurus

tags:

views:

answers:

How to replace and count frequency of a word or word sequence?

related questions