tags:

views:

304

answers:

4

On the Stackoverflow podcast this week, Jeff mentioned that in 2004 he wrote a script which queried Google with 110,000 English words and collected a database containing the number of hits for each word. They use this on Stackoverflow e.g. for the "Related" list on the right-hand side of each question page.

Since creating one of these today with a similar script would be difficult (as Joel mentioned, "at 30,000 words you get a knock at your door"), I was wondering if anyone knows of a more up-to-date, free database of Google word frequencies (e.g. for IT words which have surely changed since then such as jquery, ruby, azure, etc.).

+3  A: 

A quick Google search(!) turns up a few hits. This link looks promising:

But it's not targeted at IT words.

Mitch Wheat
A: 

You can split a list between your friends/collegues and use sufficiently large timeouts so you don't exceed 50,000 requests per day per IP, and then merging the results. I'm not sure about the legality of this approach, but the probability of having Google people "knocking at your door" using this method is pretty low.

NOTE: edited according to data provided by Skuta

bgbg
+1  A: 

According to Google, you may send 50,000 queries per day per one IP. I don't really think that it is illegal to split it between your friends..

I had similar problem with queries per day per IP but we solved it by totally different approach.

Skuta
do you mind sharing this "different" approach?
bgbg
Um...Have you ever heard of the term "DDOS"? I don't think Google would be happy if they found out you were doing that.
lacqui
It's not DDOS, my dear.
Skuta
+1  A: 

It maybe late to answer this but I can propose you different way. Instead of getting "number of hits" from Google to compute some approximation of it by yourself. Get big collection of text pages (Corpus) and count the number of each word in it. I have done this with the Wikipedia. There is a dump of all wiki pages. You just need to write a parser to extract text and to count words. The result is a list of more then 110K words (at least 2M-3M). If you really need numbers in Google search result you can get some sample of words and query Google and then make some normalization of computed values to match the Google values. I hope this helps.

Ross