There are two straightforward ways to go about finding tweets relevant to your content.
The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
Supervised Document Classification
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL
or #NBA
could all be assigned a label of sports
.
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
Semantic Overlap
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
Better Semantic Overlap
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile
and A little car
.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.