views:

162

answers:

4

Hi folks,

I think this is an interesting question, at least for me.


I have a list of words, let's say:

photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet

and I have a list of contexts:

  • Programming
  • World news
  • Technology
  • Web Design

I need to try and match words with the appropriate context/contexts if possible.

Maybe discovering word relationships in some way.

alt text


Any ideas?

Help would be much appreciated!

+3  A: 

This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.

I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.

adam
@Adam: thank you very much! This is really useful! :)
RadiantHex
+2  A: 

Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.

For example if you have two documents like this:

D1: Need to find meaning. D2: Need to separate Apples from oranges

you matrix will look like this:

      Need to find meaning Apples Oranges Separate From
D1:   1     1   1     1      0      0       0       0
D2:   1     1   0     0      1      1       1       1

This is called term by document matrix

Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD

Vlad
+2  A: 

I just found this a couple days ago: ConceptNet

It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.

If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.

tgray
@tgray: thank you very much! I'm reading through the docs
RadiantHex
+1  A: 

The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.

Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.

See here for a list of other ontologies / knowledge bases you could use.

ferdystschenko
@ferdy Oh my god!! I had the idea of using Google API to search for related Wikipedia articles last night, as keywords like 'css3' might give problems. I think I might go with your suggestion, thanks for the very informative answer!
RadiantHex
Glad I could help :)
ferdystschenko