ansaurus

Question

Determining what a word "is" - categorizing a token

Answer 1

+3 A:

Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.

also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.

And if you can solve that problem reliably you can make yourself very wealthy.

What's all this in aid of anyway?

cletus 2010-01-28 03:21:28

(+1) For describing in more detail than I did, why it is a hard problem.

harschware 2010-01-28 03:24:52

The question does not ask for disambiguation. As you see in the examples, it allows multiple categories to be output, so Java would simply be a language, a type of coffee, an island and a stock symbol all at the same time.

Max Shawabkeh 2010-01-28 03:29:50

I'm working to categorise search queries. My research indicates that a high percentage (60%+) of queries are somewhat unambiguous and if I can properly categorise them then I can present a search engine which skips the search results page in certain cases. The best example of this is a UPS tracking #. The likelihood of someone "searching" for such a number is extremely low. A ticker symbol (if unambiguous) is similar, and driving directions, address, etc... In the case of ambiguity I can simply present regular search results.

Art 2010-01-28 03:32:34

Max S - exactly, a higher level part of my system will then determine a disambiguation based on the the categorisations, if possible.

Art 2010-01-28 03:34:49

Just thought i'd mention that search engines are quite literally in the business of bringing you to the results page ;)(Unless of course you're talking about something internal or non-commercial)

Cogwheel - Matthew Orlando 2010-01-28 03:43:22

Answer 2

+1 A:

You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.

harschware 2010-01-28 03:22:34

Answer 3

+3 A:

To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).

Alex Martelli 2010-01-28 03:27:16

thanks for the heads up on the term "tagging"

Art 2010-01-28 03:38:35

Answer 4

+1 A:

As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:

Time flies like an arrow.
Fruit flies like a bananna.

In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.

It can be done -- but it really is decidedly non-trivial.

Jerry Coffin 2010-01-28 03:32:43

Answer 5

+1 A:

Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Max Shawabkeh 2010-01-28 03:33:42

ansaurus

tags:

views:

answers:

Determining what a word "is" - categorizing a token

related questions