A: 

I remember when MJ passed away, twitter went back manually and fixed the topics to point to tweets of his death. It would be a lot to ask of a computer these days to do something like this automatically, although it can loosely be done.

Daniel A. White
So you think that the list I posted above has been created manually?
Potentially yes. It probably is a mix of both.
Daniel A. White
A source verifying this would be cool.
anderstornvig
+7  A: 

What you basically want is to find the similarity between two strings.

I think the Soundex algorithm is what you're looking for. It can be used for comparing strings based on how they sound. Or as wiki describes:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

And:

Using this algorithm [EDIT: that is, "rating" words by a letter and three digits], both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" yields "A261".

There's also the Levenshtein distance.

Good luck.

anderstornvig
Thank you. I don't think Soundex or Levenshtein can help me here. The similarity between "Firefox 3" and "Firefox 2" is very high. But these two tags don't describe the same topic, though. Furthermore, some topics have only one spelling ("Monkey Islands") whereas some other topics have multiple different spellings ("Half Blood Prince"/"Half-Blood Prince").
+1  A: 

Assuming that the trending topics are generated computationally, the exact algorithm doing it on Twitter will be difficult to guess. It is most likely highly confidential and patented as well (as scary as it may sound to patent algorithms).

I find it reasonable to believe though that they'd use some sort of natural language algorithm. Depending on the case they are often really heavy to carry out computationally and will only do what you want to some extend.

An obvious helpful read on the subject is from wiki:

Good luck.

anderstornvig
I don't think you meant Neuro-linguistic programming. That is an alternative approach to interpersonal communications and psychotherapy.
Jim McKeeth
Yeah sorry, you're right. Don't know why I wrote that. It's corrected now. Thanks.
anderstornvig
+3  A: 

There are many ways to do this. One straight-forward article about google style "did you mean" checking is a good read for ideas on how to achieve this. written by peter norvig, director of research at google.

http://norvig.com/spell-correct.html

nategood
+2  A: 

"anderstornvig" mentioned the Levenshtein/edit distance, which is a great idea but isn't quite appropriate because certain permutations are more significant than other permutations. The problem seems to be that we're using a lot of domain-specific knowledge when we determine which differences are "significant" and which are "insignificant." For instance, we know that the hyphen in "Half-Blood Prince" is very important but the number in "Firefox 3" is very important.

For this reason, you might consider customizing a simple metric like Levenshtein. Add parameters that let you customize what kinds of differences are important and what kinds are unimportant.

In particular, Levenshtein counts the number of "edits" (that is, insertions, deletions, and substitutions) required to turn one string into another. Effectively, it weights every edit the same. You could write an implementation that weights some edits differently. For instance, changing a "-" to a " " should have a very low weight (indicating unimportance). Changing a "3" to a "2", when the number is alone, should have a very high weight (indicating high importance).

By parameterizing the calculation, you create an avenue for continually improving your algorithm. Build an initial configuration and run it on some test data. Find places where the metric is weak -- where it merges two terms you think should be separated, for instance -- and modify the parameterization until you're satisfied.

This way, you can train your algorithm using your domain-specific knowledge.

adrian
Thank you very much. Nice idea. I'll try this out.
+5  A: 

I'll try to answer my own question based on Broken Link's comment (thank you for this):


You've extracted phrases consisting of 1 to 3 words from your database of documents. Among these extraced phrases there are the following phrases:

  • Half Blood Prince
  • Half-Blood Prince
  • Halfblood Prince

For each phrase, you strip all special characters and blank spaces and make the string lowercase:

$phrase = 'Half Blood Prince'; $phrase = preg_replace('/[^a-z]/i', '', $phrase); $phrase = strtolower($phrase); // result is "halfbloodprince"

When you've done this, all 3 phrases (see above) have one spelling in common:

  • Half Blood Prince => halfbloodprince
  • Half-Blood Prince => halfbloodprince
  • Halfblood Prince => halfbloodprince

So "halfbloodprince" is the parent phrase. You insert both into your database, the normal phrase and the parent phrase.

To show a "Trending Topics Admin" like Twitter's you do the following:

// first select the top 10 parent phrases
$sql1 = "SELECT parentPhrase, COUNT(*) as cnt FROM phrases GROUP BY parentPhrase ORDER BY cnt DESC LIMIT 0, 10";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
    $parentPhrase = $sql3['parentPhrase'];
    $childPhrases = array(); // set up an array for the child phrases
    $fifthPart = round($sql3['cnt']*0.2);
    // now select all child phrases which make 20% of the parent phrase or more
    $sql4 = "SELECT phrase FROM phrases WHERE parentPhrase = '".$sql3['parentPhrase']."' GROUP BY phrase HAVING COUNT(*) >= ".$fifthPart;
    $sql5 = mysql_query($sql4);
    while ($sql6 = mysql_fetch_assoc($sql5)) {
        $childPhrases[] = $sql3['phrase'];
    }
    // now you have the parent phrase which is on the left side of the arrow in $parentPhrase
    // and all child phrases which are on the right side of the arrow in $childPhrases
}

Is this what you thought of, Broken Link? Would this work?

+1  A: 

Most likely they have some automatic systems that suggest likely candidates for combining, and then a human makes the ultimate choice to combine them. There may be some they combine automatically.

  • Your suggestion of removing spaces and other punctuation is a good one. Most likely they combine things that only differ on punctuation or white space alone automatically.
  • Plural vs. singular: looking for these differences would be easy to automate, and would produce likely candidates for combining.
  • Common misspellings - there is are databases of common misspellings. They might even rely on the Google API for spelling suggestions (I think they expose that).
  • Soundex (or similar) is a good one for finding spelling mistakes, but it would need to first go through the above two filters (remove spaces, punctuation and plurals) and then most likely need a human to make the call if they are the same. But if you could present a graphical representation showing clustering with the same or similar soundex then you would really make that part easy. You could automatically send a notification when a cluster starts to appear and trend (they really only care about the trending topics anyway, so if even a combined a cluster is not trending they can wait to examine it.)

Where you really need a human to step in is when there is common nicknames. Like Michael Jackson, MJ, Michael, etc. Or MacDonalds, McD, Micky-D's, etc. And then with technical you have Visual Studio, VS2008, VS, etc. or StackOverflow, SO, etc. Then C#, C-Sharp, C#.NET are all the same, but C and C++ are different.

So it would need to be a combination. It might rely on a database of known variations and combination based on previous analysis or other sources, but that database would be regularly maintained by an editor.

Jim McKeeth
Thank you very much for this detailed answer. I think the database containing mappins like "C#=>C-Sharp" is a very good idea. The spelling suggestions are interesting, too.