ansaurus

Question

Defining the context of a word - Python

Answer 1

+3 A:

This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.

I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.

adam 2010-03-23 14:45:17

@Adam: thank you very much! This is really useful! :)

RadiantHex 2010-03-23 19:14:37

Answer 2

+2 A:

Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.

For example if you have two documents like this:

D1: Need to find meaning. D2: Need to separate Apples from oranges

you matrix will look like this:

      Need to find meaning Apples Oranges Separate From
D1:   1     1   1     1      0      0       0       0
D2:   1     1   0     0      1      1       1       1

This is called term by document matrix

Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD

Vlad 2010-03-23 14:45:26

Answer 3

+2 A:

I just found this a couple days ago: ConceptNet

It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.

If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.

tgray 2010-03-23 15:21:59

@tgray: thank you very much! I'm reading through the docs

RadiantHex 2010-03-23 20:32:04

Answer 4

+1 A:

The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.

Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.

See here for a list of other ontologies / knowledge bases you could use.

ferdystschenko 2010-03-24 17:10:27

@ferdy Oh my god!! I had the idea of using Google API to search for related Wikipedia articles last night, as keywords like 'css3' might give problems. I think I might go with your suggestion, thanks for the very informative answer!

RadiantHex 2010-03-24 17:54:40

Glad I could help :)

ferdystschenko 2010-03-24 17:59:34

ansaurus

tags:

views:

answers:

Defining the context of a word - Python

related questions