tags:

views:

522

answers:

10

I have created a cron job for my website which runs every 2hours and it counts the words in the feeds, and then displays the 10 highest count words as the hot topics.

Some thing what twitter does on there homepage, to show the most popular topics that are being discussed.

What my cron job does right now is it counts the words except for the words that i have mentioned, words like:

array('of', 'a', 'an', 'also', 'besides', 'equally', 'further', 'furthermore', 'in', 'addition', 'moreover', 'too',
                        'after', 'before', 'when', 'while', 'as', 'by', 'the', 'that', 'since', 'until', 'soon', 'once', 'so', 'whenever', 'every', 'first', 'last',
                        'because', 'even', 'though', 'although', 'whereas', 'while', 'if', 'unless', 'only', 'whether', 'or', 'not', 'even',
                        'also', 'besides', 'equally', 'further', 'furthermore', 'addition', 'moreover', 'next', 'too',
                        'likewise', 'moreover', 'however', 'contrary', 'other', 'hand', 'contrast', 'nevertheless', 'brief', 'summary', 'short',
                        'for', 'example', 'for instance', 'fact', 'finally', 'in brief', 'in conclusion', 'in other words', 'in short', 'in summary', 'therefore',
                        'accordingly', 'as a result', 'consequently', 'for this reason', 'afterward', 'in the meantime', 'later', 'meanwhile', 'second', 'earlier', 'finally', 'soon', 'still', 'then', 'third');       //words that are negligible

But this does not completely solves the issue of eliminating all the non-required words. And give only the words that are useful.

Can someone please guide me on this, and tell me how can i improve my algorithm.

Regards Zeeshan

+7  A: 

Welcome to the wonderful world of language processing. Basically, anything like trending topics and friends are a search for anomalies in language use.

Theoretically, by analyzing the frequency of the words over time, you should be able to filter out the noise ( common words, like the ones you listed above ). This is not trivial to implement, but definitely a possibility.

Another appraoch would be to not concentrate on the raw amount of words in a specific period of time, but rather on the pattern in which trending topics develop. They usually grow somewhat exponential, and it should be possible to revalidate the results of your existing search by trying to apply a filter which drops all "hot words" that do not conform to this kind of growth.

Just some thoughts :-)

edit:

to further outline what i meant with filtering by frequency, maybe you should check on dictionaries containing frequency information about words. It's not so hard to built on, and with a solid text corpus ( wikipedia is free to download, i used it for a test ) you'll get remarkable good results.

moritz
So, Moritz, do you think twitter use something like your second approach to do so?
Zeeshan Rang
I was just speculating on how you might do it, but I have no idea how twitter does it, sorry :-/. But considering the immense amounts of data they deal with, I guess they incorporate a multi-phase approach, with several different filters applied.
moritz
see my answer for how to filter based on overall frequency
James Tauber
+1  A: 

What you are looking for is generally called a stop word list. Here's a blog post with a list of them (even in PHP array format for you) and another text file version.

A bit more googling should find you some other examples.

A few more potential ideas for improving the algorithm in general:

  1. Weight the word usage by recency. You already do this by recalculating every 2 hours, but you could also factor in the exact time since the word was used as well. So, rather than having each mention of a word be worth 1 "point", it's point value would be determined by the time in minutes since the message containing it was posted.

  2. Create a database table of words and their average frequency in messages on your site. When you examine the messages created in the last X hours, compare the word frequency to the average frequency in your database. Those words that have a frequency significantly higher than the average would be considered "hot". Make sure you re-calculate the average frequency for words on a semi-regular basis (once a day maybe?)

Eric Petroelje
That's a great list, but I don't think that's what he's looking for. He already has a list of stop words (`array(...)`)
brianreavis
@brianreavis - "But this does not completely solves the issue of eliminating all the non-required words. And give only the words that are useful." - seems like he was looking for ways to improve his list of stop words to me.
Eric Petroelje
Yes .. you guys are right..
Zeeshan Rang
Thanks Eric, for the more ideas that you gave for this issue. I am still looking for a more promising solution, something that is already in use.
Zeeshan Rang
+1  A: 

I don't really know if you're looking for a better way to filter unimportant words or for a way to generate the actual top ten list from a set of valid words.

For filtering, I would suggest a blacklisting approach if you want to keep it simple. If you want something more sophisticated, you could build up a statistic that determines words that are used fairly often which are then filtered from your word list.

For counting, sorting and truncating the actual "trending topic" list, I's suggest this:

function buildTopTen($word = null)
{
    static $wordList = array();
    if (isset($word)) {
        if (!isset($wordList[$word])) {
            $wordList[$word] = 0;
        }
        $wordList[$word]++;
        return $wordList[$word];
    } else {
        arsort($wordList);
        return array_slice($wordList, 0, 10);
    }
}

Just call the function with a word parameter until you're finished. It will give you the current count of that one word in return, if thats of any use to you.

Call it once without parameters and it will give you the top ten most often used words from the word you fed to it.

I tested it and performance seems OK so far.

Of course this is just a suggestion and can be refined much further.

Techpriester
+4  A: 

Here's how we implemented this for the DjangoDose live feed during DjangoCon (note: this is a hackjob, we wrote it in 1 afternoon with no testing, and be yelling Bifurcation occsaionally, as best I can tell bifurcation has nothing to do with anything). All that being said, it more or less worked for us (meaning in the evenings beer was tracked appropriately).

IGNORED_WORDS = set(open(os.path.join(settings.ROOT_PATH, 'djangocon', 'ignores.txt')).read().split())

def trending_topics(request):
    logs = sorted(os.listdir(LOG_DIRECTORY), reverse=True)[:4]
    tweets = []
    for log in logs:
        f = open(os.path.join(LOG_DIRECTORY, log), 'r')
        for line in f:
            tweets.append(simplejson.loads(line)['text'])
    words = defaultdict(int)
    for text in tweets:
        prev = None
        for word in text.split():
            word = word.strip(string.punctuation).lower()
            if word.lower() not in IGNORED_WORDS and word:
                words[word] += 1
                if prev is not None:
                    words['%s %s' % (prev, word)] += 1
                    words[prev] -= 1
                    words[word] -= 1
                prev = word
            else:
                prev = None
    trending = sorted(words.items(), key=lambda o: o[1], reverse=True)[:15]
    if request.user.is_staff:
        trending = ['%s - %s' % (word, count) for word, count in trending]
    else:
        trending = [word for word, count in trending]
    return HttpResponse(simplejson.dumps(trending))
Alex Gaynor
+3  A: 

Here's an idea:

Compute the average usage frequency of every word in the English language. Put these in a lookup table. You will likely want to only keep the most common words. Pick a number of words (say, 5000), or a minimum frequency, that makes sense. You will likely still want to have a list of words that you never show. If you sort your list of words by frequency, it won't take you long to look through them and choose which words to always exclude.

To compute the frequency, you need an input sample. Your choice of input sample will affect the outcome. For example, Twitter could use every twitter message that has ever been posted as its input sample. Topics that are constantly being discussed on Twitter (for example, "Twitter" itself) would lose importance. If you want Twitter-specific topics to keep their significance, then find another input sample.

The algorithm for computing frequency is simple:

  • For each word in the sample:
    • Look that word up in the dictionary
    • Add one to a counter associated with that word
  • To normalize the data, divide the frequency of each word by the number of words in the input sample.

Now, if you run the same algorithm on today's Twitter posts, you can compare today's word frequencies with expected word frequencies.

Neil Whitaker
yep, this is basically the heart of the z-score.
James Tauber
+6  A: 

If you want the statically significant outliers you may want to calculate a z-score for each word in a recent subset relative to the overall text.

So if

t is number of occurrences of word in subset
o is number of occurrences of word overall
n_t is number of words in subset
n_o is number of words overall

then calculate:

p_hat = t / n_t
p_0 = o / n_o

z = (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n_t)

The higher the z, the more statistically significant the mention of the word in the subset is relative to the overall text. This can also be used to calculate words that are oddly rare in the subset relative to the overall text.

James Tauber
One could also say that, provided the topic range isn't too broad, a bayesian filter would work wonders as well. Using your algrythm in combination with that sort of filter, you could get a very good database on what is a "real word" (ham) and what isn't (spam), or simply collect the frequency of words used, period, and scoring all words as "spam" for each time it is used. This will collect word usage and easily spam out stop words. Additionally, the list is dynamic, so you won't have to search stop words again in the future.
Kevin Peno
Thinking out loud on the bayesian filter more. If you want to get even more sophisticated, you could tie views of said topics (assuming people view these hot topics), and the words within them, as "ham" by the bayesian method. This would give you more than just spam scores and would help in determining what are "good words" to identify "good topics". Maybe not so much "hot topics", but topics your users are interested nontheless.
Kevin Peno
+1  A: 

You might want to research the usage of Markov Chains / Hidden Markov Models.

These mathematical models have been rather successful in Natural Language Processing.

You're accuracy on trending topics would be much higher. (And you can let it learn...)

Sentient
+1  A: 

You might want to check out NLTK (Natural Language Toolkit). There is a free book that would teach you how to use it available at http://www.nltk.org/book. The only downside is it's in python and I assume you need a PHP solution. Don't be too scared, because the book doesn't expect you to know any python beforehand.

NLKT is so very powerful and definitely worth looking into.

Gattster
+2  A: 

Here's a cheap-and-cheerful way to do it.

Every 2 hours, create a histogram of your word frequencies. Eg;

Snow, 150
Britney, 100
The, 150000

Keep all of these files.

Every so often, grab 50 files from your history and average the frequencies. This flattens out trends that have developed over time. So from these three files;

Snow, 10
Britney, 95
...

Snow, 8
Britney, 100
...

Snow, 12
Britney, 105
...

you get this baseline set;

Snow, 10
Britney, 100

Work out the ratio between this baseline set, and the most recent one;

Snow 1500%
Britney 100%

Your trends are the ones which have the highest ratios. Careful for divide-by-zeros here.

What's nice about this is that you tune your trends by picking data from a longer or shorter time period. Eg, you can see this month's trends by averaging over the year, and this afternoon's trends by averaging over the last 24 hours.

Edit -- with this algorithm, you don't need to worry about stop words, because they'll all have relatively stable ratios, at about 100%, so will always be boring (eg off-trend)

Steve Cooper
A: 

Is is not easier to scan each feed entry at creation time rather than doing a big, massive search every 2 hours?

Wolax