tags:

views:

114

answers:

3

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.

Is it possible to write a script to do something like this PHP and mysql?

I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?

+1  A: 

How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ? Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.

You might also want to add some kind of dictionary of words you don't want to count

Dominik
Do you have any suggestions for doing this efficiently. This seems like a pretty good idea.
Brian
A: 

What you need is either

  1. document classification, or..
  2. automatic tagging

Probably second one. And only then you can count their popularity in time.

Artjom Kurapov
A: 

Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.

It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.

To go beyond that, you need natural language processing tools to determine the meaning of what is said.

Ninefingers