ansaurus

Question

Trending topics: 1-word terms vs composed terms

Answer 1

+1 A:

Talking about Obama, Yes you can. :)

Maybe you could test whether your high trends are contained in lower trends before outputting them. I would try somehing like this :

Example : you have

Obama
Air France
Barack
A330
Barack Obama
...

If the list you want to output in not too long (like you are taking only the 100 best scores), you select only the ones that are not contained in others, maybe adding a 50% bonus to the one that contains the others. (You may have to take the 150 first values, then do your treatment removing the redundancy, which may come to something like 110, then trim the 10 last values to get your 100 values back.)

"Barack Obama" contains both "Barack" and "Obama" so you could give it a bonus of 100%, and your list may become :

Air France
Barack Obama
A330
...

Hope it doesn't change your algorithm too much, but actually you can plug this treatment at the end, before outputting it.

EDIT :

Or else, if you really don't list the best scores but compute only one by one, you could split your trend to compute a pondered sum of its components (like "Barack Obama"'s trend is ""Barack Obama"'s + 0.5*"Barack"'s + 0.5*"Obama"'s).

subtenante 2009-06-06 07:34:17

Thank you very much! Very simple but it works! :) Unfortunately, it covers only one case. The other case would be that my top trends are: Microsoft, China, Hillary Clinton, Dallas Mavericks. I wanted to say that "Hillary Clinton" and "Dallas Mavericks" are never ranked no1 or no2 because they're two-word terms. "Microsoft" and "China" are one-word terms so they're always ranked better. Is there any possibility to solve this problem?

2009-06-06 12:40:00

I'm not sure why that is. With the method I gave you (adding part of the trend from sub-components), you can achieve the same kind of effect as a constant multiplicator, except the multiplicator depends on the trends of the sub-components. Maybe increasing the multiplicators would do the trick, but you should be careful not to try to bend the data too much. You might give credit to wrong trends (silly example : someone whose name is MacDonald would benefit too much from a MacDonald's bankrupt and get a higher trend than it just because of his name). And no space left to talk here so good luck.

subtenante 2009-06-06 18:54:26

Not enough reputation to comment down on AviD's answer so I do it here: AviD's formula is (almost) correct. You substract trend(Barack Obama) because it is contained twice : once in trend(Barack) and once in trend(Obama) (counting all occurences of "Barack" contains all occurences of "Barack Obama" ; same for count of "Obama", so twice "Barack Obama"). But to be completely exact, you must also substract all occurences of "Obama Barack" which is also contained twice. The problem is that AviD also counts trends from "Ehud Barack" (mispelled for the sake of the example) and "Michelle Obama".

subtenante 2009-06-07 21:21:40

Answer 2

A:

Building on @subtenante's answer, the formula you should be looking for should be based on the fact that "Barack Obama" always contains "Barack" and also "Obama"...
so simple math would show that it should be:

"Barack"s + "Obama"s - "Barack Obama"s

... assuming, of course, that your partial terms are only present in the correct context, either individually or combined to the full term - i.e. "Barack" will always be referring to "Barack Obama" (and not e.g. "Ehud Barack").

AviD 2009-06-07 13:04:41

Thank you! Why do you subtract the value of "Barack Obama"?

2009-06-07 18:18:46

As @subtenante explained in his comment to his own post, I subtract "Barack Obama" because it's already included twice - its counted once for trend("Barack"), and a second time for "Obama". However, @subtenante also made a good point about the rare "Obama Barack". As I also pointed out, the sub-terms might also appear in other trends, besides the "Barack Obama" you expected.

AviD 2009-06-07 21:34:52

ansaurus

tags:

views:

answers:

Trending topics: 1-word terms vs composed terms

related questions