views:

80

answers:

3

I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?

Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?

I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.

I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.

Ideas? Has this been done before?

A: 

Hey,

Great question. I think for twitter your best bet is to use hashtags because otherwise you need to create algorithms or find existing algorithms that do language analysis and improve over time based on user input/feedback.

For facebook you can kind of do what bing implemented a while back. As I covered in this article here: http://www.socialtimes.com/2010/06/bing-adds-facebook-and-twitter-features-steps-up-social-services/

I wrote: For example, a search for “NBA Finals” will return fan-page content from Facebook, including posts from a local TV station. So if you're trying to augmented NBA related content, you could do a similar search as Bing provides - searching publically available fan-page content the way spiders index them for search engines. I'm not a developer so i'm not sure of the intricacies but I know it can be done.

Also you can display popular shared links from users who are publishing to ‘everyone’ will be aggregated for all non-fan page content. I'm not sure if this is limited to being published to 'everyone' and/or being 'popular' although I would assume so - but you can double check that.

Hope this helps

Azam Khan
A: 

The problem with NLP is not the algorithm (although that is an issue) the problem is the resources. There are some open source shallow parsing tools (that's all you would need to get intent) that you could use but parsing thousands or millions of tweets would cost a fortune in computer time.

On the other hand like you said not all tweets have hashtags and there is no promise they will be relevant.

Maybe you can use a mixture of keyword search to filter out a few possibilities (those with the highest keyword density) and then use a deeper data analysis to pick the top 1 or 2. This would keep computer resources at a minimum and you should be able to get relevant tweets.

Sruly
+2  A: 

There are two straightforward ways to go about finding tweets relevant to your content. The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.

Supervised Document Classification

Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,

tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other

If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL or #NBA could all be assigned a label of sports.

Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:

Semantic Overlap

Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.

Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.

There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.

Better Semantic Overlap

One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile and A little car.

There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.

In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.

dmcer
Very thorough and complete answer. Well done.
Marty Pitt

related questions