views:

105

answers:

2

I am effectively trying to solve the same problem as this question:

http://stackoverflow.com/questions/610399/finding-related-words-specifically-physical-objects-to-a-specific-word

minus the requirement that words represent physical objects. The answers and edited question seem to indicate that a good start is building a list of frequency of n-grams using wikipedia text as a corpus. Before I start downloading the mammoth wikipedia dump, does anyone know if such a list already exists?

PS if the original poster of the previous question sees this, I would love to know how you went about solving the problem, as your results seem excellent :-)

+1  A: 

Google as a publicly available terabyte n-garam database (up to 5).
You can order in 6 DVDs or find a torrent that host it.

Shay Erlichmen
Yes, I've considered that dataset - even more terrifyingly large than the wikipedia dumps!
mojones
It's not available for commercial use
Joel
+3  A: 

I have been working with Wikipedia as a corpus during the last days; until I properly organize it, you can find links to what I have done so far at my user page http://en.wikipedia.org/wiki/User:Tresoldi -- there are models for languages such as English, Italian, Basque, Swahili and Quechua. You can either extract the ngrams by yourself from the corpora or download the language models (in iARPA format), which contain the most common n-grams up to order 5. If you want to do everything by yourself, a good and simple starting point is the work of Petter Haugereid (here).

Giacomo