views:

158

answers:

2

Hi,

I want to know the effective algorithms/data structures to identify the below information in streaming data.

Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.

I need my queries to run on actual data but not any of the duplicates.

As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.

Identify the top k topics being discussed by the users.

I want to identify the top topics being discussed by users. I don't want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.

I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.

I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.

In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.

What will be effective algorithm/data structure to achieve this.

I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.

I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.

Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?

There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.

Identify the word boundaries

One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.

I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.

Your feedback is appreciated.

Thanks for your time.

-- Bala

A: 

I'm not really sure if I'm answering your main question, but you could determine the similarity of two messages by calculating the Levenshtein distance between them. You can think of this as the "edit difference" between two strings (I.E., how many edits would need to be made to one, to convert it to the other).

dicroce
Yeah, this distance measure can be used to calculate the near duplicates of a tweet. But I don't want to store the tweets as it will be too much data. Is there a way to do this on streaming data.
Algorist
It would be difficult to scale this to large numbers of messages (although since the Levenshtein distance follows the triangle inequality, you can use a VP tree). It would also require storing all the messages processed.
Dietrich Epp
+1  A: 

There are a couple different questions buried in here. I can't understand all that you're asking, but here's a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.

Removing duplicates is (relatively) easy. To remove "near" duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn't good enough for Twitter, as the hashed messages wouldn't be much smaller than the full messages. You'd end up with a hash that doesn't fit in memory. That's why you'd use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.

The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.

Identifying the language sounds less difficult. Don't try to do something smart like "training", instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.

If you want the algorithm to come up with categories on its own, good luck.

Dietrich Epp