tags:

views:

52

answers:

3

I need guidance in how I should compute the GoogleShare of several terms.

For example, take the following base terms:

  • "Tom Cruise" = 12,000,000 pages
  • "John Travolta" = 4,900,000 pages

Now if we add a second term:

  • "Tom Cruise" + "Scientology" = 784,000 pages
  • "John Travolta" + "Scientology" = 331,000 pages

So the GoogleShare for Tom Cruise and Scientology is (784000 * 100 / 12000000) = 6.53%, while the GoogleShare for John Travolta and Scientology is (331000 * 100 / 4900000) = 6.76%.


Now if we add a third term to our query:

  • "Tom Cruise" + "Scientology" + "StackOverflow" = 100 pages
  • "John Travolta" + "Scientology" + "StackOverflow" = 181 pages

How should I compute the GoogleShare percentage now?

// Tom Cruise
100 * 100 / 784000 = 0.01% // StackOverflow / Scientology
// or...
100 * 100 / 12000000 = 0.00083% // StackOverflow / Tom Cruise

// John Travolta
181 * 100 / 331000 = 0.05% // StackOverflow / Scientology
// or...
181 * 100 / 4900000 = 0.00369% // StackOverflow / John Travolta

John Travolta seems to be 5 times more Scientologist than Tom Cruise inside the SO community.

What is the correct way to compute the GoogleShare of N terms?

+1  A: 

It depends. First, let's lay a little groundwork on what GoogleShare is.

Consider your searches

"Tom Cruise" + "Scientology"
"John Travolta" + "Scientology"

What you're computing when you compute the GoogleShare here is the percentage of searches for "Scientology" that also contain "Tom Cruise" versus the percentage of searches for "Scientology" that also contain "John Travolta". So the way to compute this is as follows:

Google search for "Scientology": 4,730,000 hits

Compare to:

Google search for "Tom Cruise" and "Scientology": 825,000 hits
Google search for "John Travolta" and "Scientology": 340,000 hits

Therefore, the "Tom Cruise" GoogleShare of "Scientology" is 17.44%. The "John Travolta" GoogleShare of "Scientology" is 7.18%. We say that relative to "Tom Cruise" is more connected to "Scientology" than "John Travolta" is connected to "Scientology". Thus I note that your initial calculations of the GoogleShare of "Tom Cruise" versus the GoogleShare of "John Travolta" in "Scientology" were incorrect. The key is figuring out what your base search is (here it is "Scientology") and what the terms are that you want to see what their share of this space is (here it is "Tom Cruise" versus "John Travolta").

Now Consider a search

"Scientology" + "Tom Cruise" + keyword

and

"Scientology" + "John Travolta" + keyword.

There are two ways to view this. Are you trying to measure the share of "Tom Cruise" and "John Travolta" in the space of ("Scientology" + keyword) or are you trying to measure the share of "Tom Cruise" + keyword in the space of "Scientology"? These are different.

Google search for "Scientology" + "StackOverflow": 34,300

Google search for "Tom Cruise" and "Scientology" and "StackOverflow": 1,360
Google search for "John Travotla" and "Scientology" and "StackOverflow": 1,660

If you want the share of "Tom Cruise" and "John Travolta" in the space of ("Scientology" + "StackOverflow") you'd compute:

"Tom Cruise": 1360 / 34300 = 3.97%
"John Travolta": 1660 / 34300 = 4.84%

If you want the share of "Tom Cruise" + "StackOverflow" and "John Travolta" + "StackOverflow" in the space of "Scientology" you'd compute:

"Tom Cruise" + "StackOverflow": 1360 / 4730000 = .0029%
"John Travolta" + "StackOverflow" : 1660 / 4730000 = .0035%

You see, it all depends on what your base search is and what the terms are that you are trying find their share of this base term. In the first version our base search is "Scientology" + "StackOverflow" and we are seeing what share "Tom Cruise" and "John Travolta" have of this space. In the second version our base search is "Scientology" and we are seeing what share "Tom Cruise" + "StackOverflow" and "John Travolta" + "StackOverflow" have in this space.

Jason
A: 

It depends what you're after. The first figure is a measure of how often Stack Overflow is mentioned as a proportion of all results showing both Tom Cruise and Scientology, the second is a measure of how often Stack Overflow and Scientology are both mentioned as a proportion of all results showing Tom Cruise.

CodeByMoonlight
+1  A: 

I don't see the difference between N terms and, say 2 terms. Whenever you have more than 1 term, you are implicitly taking a GoogleShare with respect to some initial search term. For any N >= 2, there are multiple GoogleShares with respect to each subset of the narrow query.

You state that the "GoogleShare for Tom Cruise and Scientology" is 6.53%, but this is somewhat misleading since the term "and" tends to imply some kind of symmetry, where you could switch "Tom Cruise" and "Scientology" without changing the meaning. This is in fact not the case, since your initial term was "Tom Cruise" alone.

Perhaps a better description of the score you calculated is to say "Tom Cruise has a 'Scientology' GoogleShare of 6.53%." This removes all ambiguity, since now we know that "Tom Cruise" comes along with the term "Scientology" 6.53% of the time instead of the reverse (i.e. 6.53% of Scientology results mention Tom Cruise).

When you think of it this way, the corresponding generalization to N terms falls right out. Just stick whatever initial terms you would like in front of "has/have" and whatever additional narrowing terms you like after. With the numbers you gave, you could say that "John Travolta's Scientology references have a Stack Overflow GoogleShare of 0.05%" or that "John Travolta has a Scientology + Stack Overflow GoogleShare of 0.00369%". Pick which ever way is more informative in context.

Clueless