views:

534

answers:

2

Hello!

With your perfect help here I've already found out how to compute trending topics (standard score + floating average).

My next problem: I have the terms (consisting of 1-3 words) in my database connected with the time they were mentioned. But the trending topics are always only 1-word terms since one part of a term is ALWAYS mentioned more often than the complete term. Example: Yesterday 3 news articles were about "Barack Obama" and today 148. Then "Barack Obama" is rising, of course. But "Barack" is also rising and so it is the trending topic.

How can I include the length of a term when I compute trending topics? I don't want to use another algorithm, I'm fully pleased with the algorithm above. Can I multiply the score of all two-word terms with 1.5 or so?

Detailed example: My top trends are: Microsoft, China, Hillary Clinton, Dallas Mavericks. I wanted to say that "Hillary Clinton" and "Dallas Mavericks" are never ranked no1 or no2 because they're two-word terms. "Microsoft" and "China" are one-word terms so they're always ranked better. Is there any possibility to solve this problem?

I hope you can help me. Thanks in advance!

+1  A: 

Talking about Obama, Yes you can. :)

Maybe you could test whether your high trends are contained in lower trends before outputting them. I would try somehing like this :

Example : you have

  1. Obama
  2. Air France
  3. Barack
  4. A330
  5. Barack Obama
  6. ...

If the list you want to output in not too long (like you are taking only the 100 best scores), you select only the ones that are not contained in others, maybe adding a 50% bonus to the one that contains the others. (You may have to take the 150 first values, then do your treatment removing the redundancy, which may come to something like 110, then trim the 10 last values to get your 100 values back.)

"Barack Obama" contains both "Barack" and "Obama" so you could give it a bonus of 100%, and your list may become :

  1. Air France
  2. Barack Obama
  3. A330
  4. ...

Hope it doesn't change your algorithm too much, but actually you can plug this treatment at the end, before outputting it.

EDIT :

Or else, if you really don't list the best scores but compute only one by one, you could split your trend to compute a pondered sum of its components (like "Barack Obama"'s trend is ""Barack Obama"'s + 0.5*"Barack"'s + 0.5*"Obama"'s).

subtenante
Thank you very much! Very simple but it works! :) Unfortunately, it covers only one case. The other case would be that my top trends are: Microsoft, China, Hillary Clinton, Dallas Mavericks. I wanted to say that "Hillary Clinton" and "Dallas Mavericks" are never ranked no1 or no2 because they're two-word terms. "Microsoft" and "China" are one-word terms so they're always ranked better. Is there any possibility to solve this problem?
I'm not sure why that is. With the method I gave you (adding part of the trend from sub-components), you can achieve the same kind of effect as a constant multiplicator, except the multiplicator depends on the trends of the sub-components. Maybe increasing the multiplicators would do the trick, but you should be careful not to try to bend the data too much. You might give credit to wrong trends (silly example : someone whose name is MacDonald would benefit too much from a MacDonald's bankrupt and get a higher trend than it just because of his name). And no space left to talk here so good luck.
subtenante
Not enough reputation to comment down on AviD's answer so I do it here: AviD's formula is (almost) correct. You substract trend(Barack Obama) because it is contained twice : once in trend(Barack) and once in trend(Obama) (counting all occurences of "Barack" contains all occurences of "Barack Obama" ; same for count of "Obama", so twice "Barack Obama"). But to be completely exact, you must also substract all occurences of "Obama Barack" which is also contained twice. The problem is that AviD also counts trends from "Ehud Barack" (mispelled for the sake of the example) and "Michelle Obama".
subtenante
A: 

Building on @subtenante's answer, the formula you should be looking for should be based on the fact that "Barack Obama" always contains "Barack" and also "Obama"...
so simple math would show that it should be:

"Barack"s + "Obama"s - "Barack Obama"s

... assuming, of course, that your partial terms are only present in the correct context, either individually or combined to the full term - i.e. "Barack" will always be referring to "Barack Obama" (and not e.g. "Ehud Barack").

AviD
Thank you! Why do you subtract the value of "Barack Obama"?
As @subtenante explained in his comment to his own post, I subtract "Barack Obama" because it's already included twice - its counted once for trend("Barack"), and a second time for "Obama". However, @subtenante also made a good point about the rare "Obama Barack". As I also pointed out, the sub-terms might also appear in other trends, besides the "Barack Obama" you expected.
AviD