views:

555

answers:

4

I'm working on a "twitter filter" - more to learn ruby on rails than anything else. The idea is that I use a semantic ontology to lookup a users interests. So if a user says they're interested in "sports" that means flag any tweets that discuss "sports" "golf" "football" and so on.

I'd like to be able to expand it to any hierachial of topics, though. So if you're interested in Europe flag all the countries in Europe.

Naturally this is rather complex, so maybe we'd limit it to one or two "levels" of lookup...

How could I do this efficently? I'm pretty familiar with Java, C and Ruby, and have worked a lot with MySQL.

+2  A: 

I'd look into Doug Lenat's Cyc. It's done and open.

duffymo
Isn't this a comm. solution? Seems inappropriate as this looks like a "learn the language" project, not something with a budget.
Mike
There's an open source version - see OpenCyc. If you look at the scope of Cyc, and the length of time that it's been in development, my point is that it's a huge task to undertake. I think it's naive to think a few MySQL tables will suffice.
duffymo
A: 

I'm not sure if it will help you, but Google has something called Google Sets. You can look on it here: http://labs.google.com/sets

klew
A: 

Before you think about programming languages and technology, think about this: What kind of datastructure is a "semantic onthology"?

To me that sounds like some kind of a directed graph.

Knowing that, you'll soon find out, that it's quite easy to implement such a structure in whatever language and technology you want and that a lot of languages already have implemented some kind of a graph library (e.g. RGL for Ruby).

To me the real problem isn't how to implement such a datastructure and how to do this efficiently but how to get the semantic information you need out of twitter to build this (e.g. who tells your application that europe isn't a part of spain but that spain is a part of europe?).

Anyway, have fun implementing it, sounds like a cool project! :-)

Javier
A: 

I'm not sure what your requirements are. But it seems that either Singular Value Decomposition (SVD) or Support Vector Machines (SVM) will work for you.

Redbeard